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CHAPTER  1 

INTRODUCTION  AND  LITERATURE  SURVEY 

1.1  Introduction 

Distributed  systems  are  often  used  to  access  remote  shared  resources.  These  systems  offer  high 
availability  and  reliability  in  addition  to  speed.  A  component  failure  in  a  distributed  system  leads  only  to 
performance  degradation,  while  a  component  failure  in  a  single  processor  system  is  often  catastrophic. 

For  example,  the  files  in  a  distributed  system  usually  have  several  replicas  at  various  sites.  If  a  site 
crashes,  then  copies  of  the  files  are  available  at  other  sites.  Furthermore,  the  nonfaulty  sites  can  ignore 
the  failed  site  until  it  is  repaired.  Thus,  distributed  systems  are  particularly  suited  to  applications 
requiring  availability  in  adverse,  stressful  environments. 

Distributed  systems  are  either  synchronous  or  asynchronous.  The  clocks  of  the  processors  in  a 
synchronous  system  are  synchronized  so  that  all  processors  execute  each  instruction  within  some  fixed 
time  interval.  On  the  other  hand,  the  clocks  of  the  processors  in  an  asynchronous  system  are  not 
guaranteed  to  be  synchronized.  To  solve  a  given  problem,  the  processors  in  a  distributed  system  must  be 
able  to  communicate  with  one  another.  In  shared-memory  systems,  the  processors  communicate  through 
shared-memory  cells.  These  cells  may  reside  at  a  central  processor  that  offers  virtual  shared-memory 
services.  Shared-memory  systems  are  analogous  to  PRAMs  in  parallel  systems.  In  message-passing 
systems,  the  processors  communicate  by  passing  messages  to  one  another.  Such  systems  are  analogous  to 
fixed  interconnection  networks  in  parallel  systems. 

Link  and  node  failures  complicate  the  design  of  distributed  algorithms.  Failures  may  occur  before 
or  during  the  execution  of  the  algorithms,  and  hence  data  and  messages  may  be  lost.  There  are  several 
types  of  failure.  For  example,  failures  can  be  fail-stop,  Byzantine,  or  intermittent.  The  fail-stop  failure  is 
the  most  benign  of  the  failure  types;  a  failed  node  stops  sending  messages,  and  a  failed  link  stops 
transmitting  messages.  In  the  fail-stop  failure,  once  a  node  or  a  link  stops  sending  messages,  it  never 
sends  another  message.  On  the  other  hand,  the  Byzantine  failure  is  one  of  the  most  malicious  of  the 
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failure  types;  nodes  or  links  fail  by  altering  messages,  sending  false  information,  and  so  on.  The 
intermittent  failure  is  more  malicious  than  fail-stop  failure  but  less  malicious  than  Byzantine  failure; 
nodes  or  links  fail  by  losing  messages  at  will.  Schlichting  and  Schneider  [46]  show  how  to  design  a 
network  in  which  processors  fail  only  by  stopping. 

To  deal  with  faults,  researchers  either  design  algorithms  that  try  to  detect  faults,  or  assume  that  the 
faults  are  undetectable  and  design  enough  message-redundancy  in  the  algorithms  to  tolerate  the  failures. 
Preparata,  Metze,  and  Chien  [44]  formulate  necessary  and  sufficient  conditions  for  automatic  fault 
detection  in  systems  with  multiple  faults.  Unfortunately,  fault  detection  is  not  practical  in  asynchronous 
systems  since  it  forces  the  nodes  to  distinguish  between  very  slow  links  and  faulty  links.  Furthermore, 
fault  detection  wastes  valuable  time  in  synchronous  systems.  Therefore,  it  is  desirable  to  design  fault- 
tolerant  distributed  algorithms  that  do  not  rely  on  fault  detection. 

In  this  thesis,  we  solve  some  important  problems  often  encountered  in  the  design  of  fault-tolerant 
distributed  algorithms:  Process  agreement  and  election. 

A  fundamental  problem  of  distributed  computation  is  the  construction  of  protocols  to  reach 
agreement  among  the  processes  in  a  distributed  computer  system.  Agreement  appears  in  the  design  of 
mutual  exclusion  algorithms  for  processes  that  compete  for  a  shared  resource.  By  communicating  among 
themselves,  the  processes  agree  on  which  process  gains  access  to  the  resource.  Agreement  also  appears  in 
transaction  commitment  protocols  for  distributed  databases.  A  transaction  may  invoke  several  processes 
-  sometimes  called  the  agents  of  the  transaction  -  to  access  data  records.  To  commit  the  transaction,  all 
processes  must  agree  to  write  the  new  values  of  the  records. 

Election  is  the  problem  of  choosing  a  unique  processor  as  the  leader  of  a  network  of  processors. 

The  processors  are  identical  except  that  each  processor  has  a  unique  identifier  chosen  from  a  totally 
ordered  set.  Initially,  no  processor  knows  the  identifier  of  any  other  processor.  Hence,  the  processors 
cannot  elect  a  predetermined  leader.  Two  adjacent  processors  communicate  by  sending  messages  to  each 
other  on  the  communication  link  connecting  them.  Two  nonadjacent  processors  communicate  through 
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processors  adjacent  to  both  of  them.  The  election  problem  occurs,  for  instance,  when  a  processor  must  be 
selected  to  replace  a  malfunctioning  central  lock  coordinator  in  a  distributed  database  system,  or  to 
replace  a  primary  site  in  a  replicated  distributed  file  system  [5],  [23],  [37].  Election  also  occurs  in 
token-passing  systems.  The  processors  in  these  systems  pass  a  unique  token  among  themselves.  The 
processor  that  receives  the  token  executes  the  processor’s  algorithm  and  then  sends  the  token  to  some 
other  processor.  If  the  token  is  lost,  then  the  processors  elect  a  leader  to  issue  a  new  token. 

\2  Literature  Survey 
1.2.1  Process  agreement 

Protocols  for  agreement  depend  on  the  characteristics  of  the  distributed  system.  The  system  may  be 
synchronous  or  asynchronous.  The  processes  may  communicate  by  passing  messages  or  by  accessing 
cells  in  a  shared  memory.  Process  failures  may  be  Byzantine,  in  which  a  faulty  process  may 
communicate  maliciously,  or  fail-stop,  in  which  a  process  dies  without  communicating  further. 

Pease,  Shostak,  and  Lamport  [40]  showed  how  to  achieve  agreement  in  synchronous  message¬ 
passing  systems  with  Byzantine  failures.  In  this  situation  the  problem  of  reaching  agreement  is  called  the 
Byzantine  Generals  Problem.  Dolev  et  al.  [15]  and  Dolev  and  Strong  [17]  devised  more  efficient 
solutions  for  the  Byzantine  Generals  Problem.  The  most  efficient  solution  to  the  problem  was  presented 
byCoan  [13].  His  paper  also  includes  an  extensive  bibliography  on  the  problem.  Dolev,  Dwork,  Lynch, 
and  Stockmeyer  [14],  [18]  gave  conditions  under  which  limited  asynchronism  can  be  tolerated  in  the 
Byzantine  Generals  Problem. 

Now  consider  asynchronous  systems  with  fail-stop  processes.  For  detectable  process  death,  a 
mutual  exclusion  algorithm  of  Bums  [9]  can  be  modified  to  achieve  agreement  with  two-valued  shared 
memory  cells,  and  the  protocols  of  Schneider  [47]  achieve  agreement  on  message-passing  systems.  For 
undetectable  process  death,  the  decentralized  simulation  of  resource  managers  of  Jaffe  [31]  achieves 
agreement  with  test-and-set  operations  on  four-valued  shared  memory  cells.  (Actually,  three-valued  cells 
suffice  [30].)  Indeed,  Jaffe’s  simulation  uses  O  (n2)  cells  to  achieve  agreement  an  unlimited  number  of 
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times.  On  the  other  hand,  Fischer,  Lynch,  Merrit,  and  Paterson  [19],  [20]  proved  that  agreement  cannot 
be  achieved  in  asynchronous  message-passing  systems  with  the  undetectable  death  of  even  one  process; 
their  result  holds  a  fortiori  for  Byzantine  failures. 

All  papers  that  we  have  cited  treat  only  deterministic  protocols.  Ben-Or  [8]  designed  a  simple 
probabilistic  agreement  protocol  for  completely  asynchronous  message-passing  systems  with  Byzantine 
failures. 

12.2  Election 

There  are  three  common  measures  of  the  efficiency  of  a  distributed  algorithm;  the  maximum 
number  of  messages  sent  during  any  execution  of  the  algorithm,  the  maximum  running  time  of  the 
algorithm,  and  the  size  of  the  messages  required  in  the  algorithm.  Election  algorithms  should  minimize 
all  three  measures. 

Several  synchronous  and  asynchronous  election  algorithms  have  been  proposed  on  a  variety  of 
network  topologies.  Gaftii  [23],andGallager,  Humblet,  andSpira  [24]  prove  that  election  algorithms 
that  work  for  all  networks  require  ©(m+n  log  n)  messages,  where  m  is  the  number  of  links  in  the  network. 
Now  consider  election  algorithms  that  work  for  specific  networks.  Bums  [9],  Frederickson  and 
Lynch  [22],  and  Pachl,  Korach,  and  Rotem  [39]  show  that  election  in  synchronous  rings  requires 
Q(nlog  n)  messages.  Since  synchronous  rings  are  a  special  case  of  asynchronous  rings,  election 
algorithms  for  asynchronous  rings  also  require  O(nlog  n)  messages.  A  series  of  increasingly  efficient 
election  algorithms  for  rings  appeared  in  the  literature  [11],  [28],  [42].  The  recent  algorithm  by  van 
Leeuwen  and  Tan  [49]  works  for  synchronous  and  asynchronous  rings  and  uses  1.44  nlog  n  messages  in 
the  worst  case.  Similarly,  Afek  and  Gafni  [4]  and  Peterson  [41]  prove  that  election  in  synchronous  and 
asynchronous  complete  networks  requires  ©(nlog  n)  messages.  Afek  and  Gafni  show  that  election 
algorithms  for  synchronous  complete  networks  require  ©(nlog  n)  messages  and  run  in  time  ©(log  n)  time 
units.  Next,  Afek  and  Gafni  [4],  and  Peterson  [41],  develop  several  asynchronous  algorithms  for 
complete  networks  that  have  a  time  complexity  of  0{n)  time  units  and  a  message  complexity  of 
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0{nlog  n)  messages. 

Election  algorithms  for  some  networks  can  be  improved  if  the  processors  have  additional 
information  besides  the  topology.  For  example,  Loui,  Matsushita,  and  West  [35]  show  that  O(n) 
messages  suffice  for  election  in  asynchronous  complete  networks  if  at  each  processor  the  label  on  each 
link  gives  the  distance,  along  a  fixed  Hamiltonian  cycle,  to  the  processor  at  the  other  end  of  the  link. 

Also,  Peterson  [41]  proves  that  election  in  asynchronous  square  meshes  requires  only  O  (n)  messages  if 
the  processors  have  a  sense  of  direction  [45],  i.e.,  the  processors  have  a  consistent  view  of  the  north,  east, 
south,  and  west  direction  in  the  mesh.  When  each  processor  can  distinguish  between  its  left  link  and  its 
right  link  in  a  ring,  Dolev,  Klawe,  and  Rodeh  [16]  present  an  election  algorithm  that  works  for 
synchronous  and  asynchronous  rings  and  uses  1.356  nlog  n  messages  in  the  worst  case. 

The  references  we  cited  so  far  assume  that  the  election  algorithms  are  comparison  algorithms,  i.e., 
the  election  algorithms  are  restricted  to  use  only  comparisons  of  processor  identifiers.  Frederickson  and 
Lynch  [22]  prove  that  both  the  departure  from  the  comparison  model  and  the  possibility  of  using  a  large 
number  of  rounds  are  necessary  in  order  to  obtain  an  election  algorithm  of  O  (n)  message  complexity  for 
synchronous  rings.  Gafni  [23]  presents  a  synchronous  noncomparison  election  algorithm  for  rings.  The 
algorithm  is  of  message  complexity  6 (n)  and  of  time  complexity  0(/r2',+|  T  | 2),  where  |  T  |  is  the 
cardinality  of  the  set  of  the  node  identifiers. 

Processor  and  communication  link  failures  complicate  the  election  problem.  The  impossibility 
result  of  Fischer,  Lynch,  and  Paterson  [20]  implies  that  if  a  node  may  fail  by  stopping,  then  no  election 
algorithm  exists  for  asynchronous  networks  even  if  all  the  links  are  reliable.  On  the  other  hand,  the 
synchronous  algorithms  of  Pease,  Shostak,  and  Lamport  [40],  Dolev  et  al.  [15],  Dolev  and  Strong  [17], 
and  Coan  [13]  can  be  modified  to  obtain  election  algorithms  for  synchronous  complete  networks  with 
Byzantine  node  failures. 

Only  recently  have  algorithms  been  designed  for  networks  with  faulty  links.  Goldreich  and 
Shrira  [25]  study  election  in  asynchronous  rings  with  one  intermittently  faulty  link.  If  n  is  known  to  all 
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the  nodes,  then  they  present  an  algorithm  that  uses  0(niog  n )  messages;  otherwise,  they  develop  an 
algorithm  that  uses  0(n2)  messages.  Cimet  and  Kumar  [12]  discuss  an  algorithm  that  elects  a  leader 
when  links  fail  detectably.  The  type  of  failure  they  consider  is  fail-stop. 

1J  Contents  of  the  Thesis 

Chapters  2,  3,  and  4  are  self-contained  papers,  except  that  all  the  references  are  grouped  together  at 
the  end  of  the  thesis.  The  reader  may  choose  to  read  any  chapter  without  referring  to  other  chapters. 

In  Chapter  2,  we  characterize  completely  the  shared-memory  requirements  for  achieving  agreement 
in  an  asynchronous  system  of  fail-stop  processes  that  die  undetectably.  This  work,  co-authored  with 
Michael  C.  Loui,  appeared  in  Advances  in  Computing  Research  [34].  Section  2.4  appeared  in  Hosame 
Hassan  Abu-Amara’s  M.S.  thesis  [3]. 

In  Chapter  3,  we  present  an  algorithm  for  election  in  asynchronous  complete  networks  when  the 
links  may  fail  intermittently  and  undetectably.  The  algorithm  appeared  in  the  IEEE  Transactions  on 
Computers  [2]. 

In  Chapter  4,  we  describe  an  efficient  algorithm  for  election  in  synchronous  square  meshes.  Also, 
we  prove  a  lower  bound  on  the  number  of  messages  any  algorithm  uses  in  synchronous  meshes. 

In  Chapter  5,  we  discuss  some  open  problems. 
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CHAPTER  2 

MEMORY  REQUIREMENTS  FOR  AGREEMENT 
AMONG  UNRELIABLE  ASYNCHRONOUS  PROCESSES 

2.1  Introduction 

We  study  an  asynchronous  system  of  processes  that  communicate  via  a  shared  memory  and  die 
undetectably.  Each  process  has  a  binary  input,  and  the  processes  together  must  agree  on  a  decision  that  is 
one  of  their  inputs.  We  prove  that  there  is  no  agreement  protocol  that  uses  only  read  and  write 
operations,  even  if  the  protocol  may  assume  that  at  most  one  process  dies.  Furthermore,  there  is  no 
agreement  protocol  that  uses  test-and-set  operations  if  the  memory  cells  have  only  two  values  and  two  or 
more  processes  may  die,  but  there  are  test-and-set  protocols  if  either  memory  cells  have  three  values  or  at 
most  one  process  dies.  A  table  in  Section  2.6  summarizes  our  results.  Our  results  imply  that  Jaffe’s 
simulation  [31]  cannot  be  modified  to  use  two-valued  memory  cells. 

Fischer,  Lynch,  and  Paterson  [20]  proved  that  agreement  cannot  be  achieved  in  asynchronous 
message-passing  systems  with  the  undetectable  death  of  even  one  process;  their  result  holds  a  fortiori  for 
Byzantine  failures.  There  are  three  major  differences  between  their  paper  and  ours,  however.  First, 
Fischer,  Lynch,  and  Paterson  assumed  only  one  kind  of  "atomic  step”  similar  to  our  test-and-set 
operation;  in  one  step  a  process  receives  a  message,  changes  state,  and  broadcasts  a  message  to  all 
processes.  Besides  test-and-set  protocols,  we  study  protocols  with  read  and  write  operations.  Second, 
Fischer,  Lynch,  and  Paterson  did  not  consider  the  message  contents,  whereas  we  show  that  the  size  of  the 
set  of  values  that  cells  may  store  affects  whether  agreement  is  possible.  Third,  Fischer,  Lynch,  and 
Paterson  assumed  a  weak  communication  mechanism:  there  is  no  bound  on  the  delay  between  sending 
and  receiving  a  message.  In  contrast,  in  our  shared  memory  the  new  value  of  a  cell  becomes  available  for 
reading  immediately  after  a  process  writes  the  cell.  Dolev,  Dwork,  and  Stockmeyer  [14]  called  the 
communication  mechanism  of  Fischer,  Lynch,  and  Paterson  asynchronous  and  our  communication 
mechanism  synchronous.  Recently,  Abrahamson  [1]  and  Heriihy  [27]  generalized  our  results. 
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Section  2.2  defines  process  systems,  agreement  protocols,  resilience,  and  computation  graphs. 
Section  2.3  presents  three  simple  agreement  protocols.  Section  2.4  establishes  the  impossibility  of  fully 
resilient  agreement,  Section  2.5  the  impossibility  of  ^-resilient  agreement.  Although  the  results  of 
Section  2.5  supersede  the  results  of  Section  2.4,  their  proofs  are  significantly  more  difficult.  Since 
Sections  2.4  and  2.5  are  each  self-contained,  the  reader  may  choose  to  read  only  the  section  of  interest 

22  Definitions 
2.2.1  Process  systems 

Informally,  a  process  system  is  a  set  of  asynchronously  executing  processes  in  a  computer  system. 
The  processes  communicate  via  a  shared  memory.  Any  memory  local  to  a  process  is  incorporated  into  its 
state. 

Formally,  a  process  system  S  consists  of  the  following: 

(1)  A  finite  set  of  memory  cells,  denoted  MEM(S).  Let  m  =  \MEM  (S)  I ,  and  let  VAL  ( S )  be  the  finite 
set  of  possible  cell  values.  For  nontriviality,  I  VAL  (S)  I  £  2. 

(2)  A  set  of  processes,  denoted  PROC  (S).  Let  n  *  I PROC  (S)  I .  Let  STATES  (J,S)  be  the  finite  set  of 

states  of  process  j  forj  =  1 . n.  STATES  ( i,S )  and  STATES  (j,S)  are  disjoint  if  i  *  j. 

(3)  A  cell  assignment  function  * 

* 

cell:  vj  STATES  ( j,S )  ->  MEM(S) 
i~  i 

that  associates  with  each  process  state  q  in  each  STATES  ( j,S )  the  cell  cell  (q)  in  MEM  (S)  that 
process  j  accesses  when  it  is  in  state  q. 

(4)  Partial  process  transition  functions  5i . 5, ,  with  each 

8,:  STATES  (J,S)  x  VAL  (S)  -*  STATES  ( j.S )  x  VAL  (S). 

8 j(q,v)  may  not  be  defined  for  some  process  states  q. 


The  set  of  system  states  of  S  is 
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SYS  (S)  =  STATES  (l, S)  x  STATES  (2,S)  x  •••  x  STATES  {n,S)x[VAL(S)]m. 

The  state  of  process  j  in  system  state  s,  denoted  state  (J,s),  is  the  jth  coordinate.  The  value  of  cell  c  in 

system  state  s  is  denoted  value  (c,s).  In  an  initial  system  state  all  cells  have  the  same  value.  The  state  of 

a  process  in  an  initial  system  state  is  its  initial  state. 

The  process  transition  functions  induce  a  system  transition  function 

A:  {\, 2. x  SYS(S)  -»  SYS(S) 

defined  as  follows.  For  j  in  {1,2, . . . ,«}  and  s  in  SYS(S)  let  q  =  state  (J,s),  c  =  cell(q),  v  =  value  (c,s), 
and  §j(q,v)  =  (q‘  ,v‘).  Then 

state  (J,  A(J,s))  -q  ,  and  state  (i.  A  (j,s))  =  state  ( i,s )  for  all  i*j\ 
value  (c,  A (J,s))  =  v  ,  and  value  ( d ,  AO', J»  =  value  ( d,s )  for  all  d*c. 

We  say  that  A(J,s)  is  the  new  system  state  after  a  transition  from  j  by  process  j.  The  definition  of  A(j,s) 
implies  that  only  process  j  may  change  state  and  only  cell  c  may  change  value.  Moreover,  the  new  state 
of  process  j  and  the  new  value  of  c  depend  on  no  other  process  states  or  cell  values. 

A  computation  is  a  sequence  of  system  states 

Jo.  J i ,  J2* 

such  that  s0  is  an  initial  system  state,  and  for  every  i  there  is  some  process  j  such  that  si+1  =  AO'.J;).  A 
computation  terminates  at  system  state  Sf  if  there  is  no  process  j  for  which  A(j,sf)  is  defined.  Call  Sf  the 
final  system  state  in  the  computation. 

2.2.2  Agreement  protocols 

After  we  define  agreement  protocols  informally,  we  give  precise  definitions. 

Every  process  in  S  has  an  input  in  {0,1 }.  An  agreement  protocol  n  for  5  is  a  protocol  such  that  at 
the  end  of  every  computation  induced  by  n  the  processes  must  agree  on  a  common  decision  that  is  the 
input  of  one  of  the  processes.  Furthermore,  n  has  these  properties: 

(1)  every  computation  induced  by  n  terminates; 


10 


(2)  n  tolerates  undetectable  process  death; 

(3)  the  processes  communicate  only  through  the  shared  memory. 

All  processes  must  eventually  agree  on  either  0  or  1.  The  processes  cannot  agree  on  a  fixed  value  such  as 
0.  For  example,  if  all  inputs  are  1,  then  all  decisions  must  be  1 . 

Formally,  for  every  process  j,  STATES  ( j,S )  has  five  special  states  B°r  B ) ,  £y ,  E) ,  and  Dy.  The 
initial  state  of  process  j  is  either  Bj  or  B j .  The  input  of  process  /  is  0  (1)  if  its  initial  state  is  £y  ( B ) ). 

The  decision  of  process  j  is  0  (1)  if  state  (j.s/)  =  Ej-  (E) ).  Process  j  is  dead  in  system  state  s  if  state  ( j,s ) 
=  Dj.  If  process  j  is  not  dead  in  system  state  s,  then  it  is  live  in  s.  Section  2.2.3  will  discuss  dead 
processes  further.  Once  process  j  enters  £°  or  E)  or  Dj,  it  makes  no  further  transitions;  8y(£y ,  v), 

8y(£*  ,v),  and  8y(Dy,v)  are  undefined  for  all  cell  values  v. 

An  agreement  protocol  n  for  5  is  a  specification  of  the  process  states  and  process  transition 
functions  of  S  such  that 

(1)  every  computation  induced  by  IT  terminates; 

(2)  for  every  computation  induced  by  n  the  decisions  of  all  live  processes  are  the  same; 

(3)  if  some  process  in  the  final  system  state  Sf  is  live,  then  its  decision  is  the  input  of  one  of  the 
processes. 

Note  that  n  must  necessarily  tolerate  undetectable  process  death  because  for  every  j,  when  process  j 
makes  a  transition,  it  does  not  inspect  the  state  of  any  other  process. 

Now  we  define  two  kinds  of  agreement  protocols;  read/write  protocols  and  test-and-set  protocols. 
To  simplify  the  notation  assume  that  VAL(S)  =  (0,1 }.  It  is  straightforward  to  modify  the  following 
definitions  for  I VAL  (S)  I  >2. 

For  read/write  protocols  a  process  may  atomically  read  or  atomically  write  a  cell.  Formally,  each 
state  in  STATES  (J.S)  other  than  £°,  £ j ,  or  Dj  can  have  two  forms: 
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(i)  q  =  (J,READ,c,rQ,r{),  where  c  e  MEM { S )  and  r0,rt  e  STATES  O', 5).  Here  cell  ( q )  =  c. 
Furthermore, 

8;(<7,0)  =  (r0,0)  and  8, (<7.  l)  =  (rltl). 

That  is,  process  j  reads  cell  c.  If  the  value  of  c  is  0  (1),  then  the  next  state  of  process  j  is  r0  (r  i). 
and  the  value  of  c  remains  the  same.  Call  q  a  READ  state. 

(ii)  q  =  (J, WRITE, c.r,v),  where  c  e  MEM(S),  r  e  STATES(J,S),  and  v  e  VAL(S).  Here  cell (q)  =  c. 
Furthermore, 

8,(«7.0)  =  8, (?,  l)  =  (r,v). 

That  is,  without  inspecting  the  current  value  of  c,  process  j  writes  the  value  v  into  c  and  enters  state 
r.  Call  q  a  WRITE  state. 

For  test-and-set  protocols  a  process  may  read  (test)  a  cell,  and  depending  on  its  value,  write  (set)  a 
new  value  into  the  cell  and  change  state  in  one  atomic  step.  Formally,  each  state  in  STATES  (J,S)  other 
than  Ej,E),  or  Dj  has  the  form 

q  ~(J,c,r0,ri,v0,V\), 

where  c  e'MEM (S),  r0,ri  e  STATES (j,S)  and  v0,vi  e  VAL (S).  Here  c  =  cell (q).  Furthermore, 

$,(<?, O)  =  (r0,v0)  and  8,07,  1)  =  (r,,Vj). 

That  is,  if  the  value  of  c  is  0  (1),  then  the  next  state  of  process  j  is  r0  (r  and  the  new  value  of  c  is  v0 
(v  1).  With  a  test-and-set  operation  one  can  implement  both  a  read  operation  (by  setting  v0=O  and  v  j=l) 
and  a  write  operation  (by  setting  r0=r  t  and  v  0=v  1 ). 

This  test-and-set  operation  was  popularized  by  Lynch  and  Fischer  [36],  In  the  traditional,  weaker, 
test-and-set  operation  (Peterson  and  Silberschatz  [43])  the  process  must  set  the  cell  to  the  same  value, 
regardless  of  the  cell’s  old  value.  Our  test-and-set  operation  can  implement  the  weak  test-and-set 
operation  by  setting  vq=v  1 . 
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2.2.3  Process  death 

Section  2.2.1  defined  a  deterministic  system  transition  function  A.  We  modify  the  definition  to 
model  spontaneous  process  death  -  a  process  may  die  at  any  time.  Define  A  to  be  a  nondeterministic 
partial  function  induced  by  the  process  transition  functions  as  follows.  For  j  in  { 1,2, . . .  ,n)  and  s  in 
SYS  (S)  let  q  =  state  c  =  cell (q),  v  =  value  (c,s),  and  5 j(q,v)  =  (<?  ,  v  ).  Then 
state  (J,A(J,s))e  fq',Dj}\ 
state  ( i ,  A (j,s))  =  state  ( i,s )  for  all  i*j; 
value  {c,A(j,s))  =  v  if  state  (J,  A(J,s))  =  q  ,but 
value  (c,  A(j,s))  =  value  ( c,s )  if  state  (J,  A(J,s))  =  Df,  and 
value  {d,A(j,s))  =  value  ( d,s )  for  all  d*c. 

In  other  words,  A(J,s)  is  one  of  two  new  system  states.  If  process  j  dies  by  entering  state  DJt  then  no  cell 
value  changes.  In  particular,  a  process  may  die  on  its  first  transition  without  changing  the  value  of  a 
memory  cell.  After  a  process  dies,  it  makes  no  further  transitions. 

2.2.4  Resilience 

Protocols  that  allow  processes  to  wait  differ  subtly  from  protocols  that  always  force  processes  to 
make  progress.  A  protocol  is  k-resilient  if  it  achieves  agreement  in  all  computations  with  at  most  k 
process  deaths,  provided  that  every  live  process  eventually  makes  transitions.  Call  an  (n-l)-resilient 
protocol  fully  resilient.  By  definition,  every  k-resilient  protocol  is  k  ’ -resilient  for  alt  fc'<k.  a  k-resilient 
protocol  could  allow  a  process  to  wait  for  other  processes  to  take  transitions.  For  example,  standard 
mutual  exclusion  protocols  are  O-resilient  since  they  assume  a  perfectly  reliable  system;  in  these  protocols 
fast  processes  wait  for  slow  processes  that  eventually  make  transitions.  A  1 -resilient  protocol  assumes 
that  at  most  one  process  will  die;  thus  all  processes  can  wait  for  one  of  two  distinguished  processes  to 
change  the  value  of  a  cell.  In  contrast,  in  fully  resilient  protocols  live  processes  must  not  wait  for  other 
processes  to  make  transitions.  The  simulation  of  Jaffe  [31]  requires  fully  resilient  protocols.  This 
simulation  assumes  that  a  dead  process  is  indistinguishable  from  a  slow  process. 
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2-2_5  Computation  graphs 

The  following  definitions  are  used  in  Section  2.4  and  Section  2.5. 

Let  n  be  an  agreement  protocol,  and  let  Jo  be  an  initial  system  state.  Define  a  directed  computation 
graph  T  =  (V,A)  for  n  as  follows.  The  node  set  V  is  the  set  of  system  states  s  such  that  s  has  no  dead 
processes  and  s  is  in  some  computation  of  n  starting  from  some  initial  system  state.  There  is  an  arc  (s,t) 
in  A  if  and  only  if  A(j,s )  =  t  for  some  process  j;  that  is,  t  follows  from  s  after  a  transition  by  process  j. 
Label  arc  (s,t)  with  j.  Call  /  a  child  of  s.  Node  u  is  reachable  from  node  s  if  there  is  a  directed  path  from 
r  to  u. 

In  the  sequel  we  shall  not  distinguish  between  a  node  s  in  V  and  the  system  state  corresponding  to  s. 
Also,  we  shall  not  distinguish  between  an  arc  and  the  process  transition  corresponding  to  the  arc. 

A  leaf  of  T  is  a  node  with  no  children.  By  construction,  the  leaves  of  T  are  system  states  in  which 
the  state  of  every  process  j  is  in  (£°,£j ) .  Since  n  is  an  agreement  protocol,  the  processes  at  a  leaf  t  must 
have  the  same  decision.  Call  t  a  O-leaf  (l -leaf)  if  the  decision  of  the  processes  in  system  state  r  is  0  (1). 
Call  a  node  s  bivalent  if  there  is  a  O-leaf  reachable  from  s  and  a  1-leaf  reachable  from  s.  Call  a  node  s  0- 
valent  if  every  leaf  reachable  from  s  is  a  O-leaf.  By  definition,  every  node  reachable  from  a  0-valent  node 
is  0-valent  Also,  if  state  (J.s)  =  E°  for  some  process  j,  then  s  is  0-valent  because  j  will  make  no  further 
transitions,  and  all  processes  will  agree  with  the  decision  of  j.  Similarly,  call  a  node  s  1 -valent  if  every 
leaf  reachable  from  s  is  a  1-leaf.  Call  node  s  univalent  if  it  is  either  0-valent  or  1 -valent.  Every  node  is 
either  bivalent  or  univalent. 

23  Three  Agreement  Protocols 

This  section  describes  three  simple  test-and-set  protocols  that  motivate  the  main  results  of  Sections 
4  and  5.  We  present  the  protocols  informally,  it  is  straightforward  to  express  the  protocols  in  terms  of 
process  states  and  process  transition  functions. 

First,  let  PROC (S)  =  { 1,2, .... «}.  We  can  guarantee  that  all  processes  achieve  the  same  decision 
by  using  one  three-valued  cell  c.  That  is,  VAL(S)  =  (0,1,#},  and  MEM(S )  *  (c).  Assume  that  the  initial 
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value  of  c  is  #  (blank).  In  the  agreement  protocol  for  this  system  the  first  process  to  take  a  transition  sets 
c  to  its  input,  and  all  later  live  processes  adopt  this  value  as  their  decisions. 

Protocol  1.  Every  process  executes  the  following: 

1 .  Test-and-set  c.  If  the  value  of  c  is  #,  then  set  ctoO  (1)  if  the  input  is  0  (1)  and  keep  the  input  as  the 

decision.  If  the  value  of  c  is  not  #,  then  set  the  decision  to  the  value  of  c. 

Theorem  3.1:  For  any  number  of  processes,  there  is  a  fully  resilient  agreement  protocol  that  uses  test- 
and-set  operations  on  one  three-valued  cell. 

Second,  consider  a  system  S  with  two  processes,  PROC  (S)  =  {12}.  and  two-valued  cells,  VAL  ( S )  = 
(0,1 } .  Assume  that  the  initial  value  of  every  cell  is  0.  It  is  easy  to  design  an  agreement  protocol  with 
three  cells:  each  process  uses  one  cell  to  announce  its  input,  and  then  the  processes  test-and-set  the  third 
cell  to  1 ;  the  decision  of  both  processes  is  the  input  of  the  process  that  changed  this  cell  from  0  to  1 .  We 
show  that  just  two  cells  suffice.  Let  MEM  (S)  =  (c,d).  Both  processes  execute  the  following  protocol. 

Protocol  2. 

1.  Test-and-set  c.  If  the  value  of  c  is  1,  then  leave  c  unchanged,  set  the  decision  to  1,  and  halt.  If  the 
value  of  c  is  0,  then  set  the  value  of  c  to  the  input,  and  continue. 

2.  Test-and-set  d.  If  the  value  of  d  is  0,  then  set  d  to  1 ,  keep  the  input  as  the  decision,  and  halt.  If  the 

value  of  d  is  1,  then  leave  d  unchanged,  and  continue. 

3.  Read  c.  If  the  value  of  c  is  0,  then  set  the  decision  to  0.  If  the  value  of  c  is  1,  then  set  the  decision 
to  the  complement  of  the  input;  that  is,  if  the  input  is  0  (1),  then  the  decision  is  1  (0). 

Theorem  3.2:  For  two  processes  there  is  an  agreement  protocol  that  uses  test-and-set  operations  on  two 
two-valued  cells. 

Proof:  We  establish  the  correctness  of  Protocol  2.  Call  the  processes  George  and  Hannah,  and  assume 
that  George  performs  the  test-and-set  on  c  fust. 
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Case  1:  The  input  of  George  is  i.  Since  George  sets  c  to  1,  Hannah  determines  that  the  value  of  c 
is  l,  and  Hannah  halts  with  decision  1  without  accessing  d.  Thus,  George  must  find  that  d  is  0.  After 
George  sets  d  to  1,  it  halts  with  decision  1. 

Case  2:  The  input  of  George  is  0,  and  George  sets  d  to  1  first.  In  this  case  George  retains  0  as  its 
decision.  After  Hannah  finds  that  d  is  1,  it  reads  c  again.  The  value  of  c  at  this  time  is  Hannah’s  input.  If 
c  is  0,  then  Hannah  halts  with  decision  0,  agreeing  with  George.  If  c  is  1,  then  Hannah  sets  its  decision  to 
0,  the  complement  of  its  input,  again  agreeing  with  George. 

Case  3:  The  input  of  George  is  0,  and  Hannah  sets  d  to  1  first  In  this  case  Hannah  retains  its  input 
as  its  decision.  After  George  finds  that  d  is  1,  it  reads  c  again.  The  value  of  c  at  this  time  is  Hannah’s 
input.  If  c  is  0,  then  George  halts  with  decision  0,  agreeing  with  Hannah.  If  c  is  1,  then  George  sets  its 
decision  to  1,  the  complement  of  its  input  again  agreeing  with  Hannah.D 

Section  2.4.3  shows  that  there  is  no  fully  resilient  test-and-set  protocol  with  two-valued  cells  if  there 
are  more  than  two  processes. 

Third,  we  present  a  1 -resilient  test-and-set  protocol  for  n  processes  with  four  two-valued  cells. 

Consider  a  system  S  with  PROC  (S)  =  {1,2 . n),MEM(S)=  [c,d,e,f],mdVAL(S)  =  (0,1).  Assume 

that  the  initial  value  of  every  cell  is  0.  In  the  agreement  protocol  for  this  system  process  1  and  process  2 
execute  Protocol  2  using  cells  c  and  d.  Then  both  write  the  decision  into  cell  e  and  set  cell /(‘‘finished”) 
to  1  to  signify  the  completion  of  the  protocol.  Processes  3, .... «  wait  until  either  process  1  or  process  2 
sets /to  1. 

Protocol  3.  Processes  1  and  2  execute  A1  through  A5  below. 

Al.  Test-and-set  c.  If  the  value  of  c  is  1,  then  leave  c  unchanged,  set  the  decision  to  1,  and  go  to  step 

A4.  If  the  value  of  c  is  0,  then  set  the  value  of  c  to  the  input,  and  continue. 

A2.  Test-and-set  d.  If  the  value  of  d  is  0,  then  set  d  to  1 ,  keep  the  input  as  the  decision,  and  go  to  step 

A4.  If  the  value  of  d  is  1,  then  leave  d  unchanged,  and  continue. 
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A3.  Read  c.  If  the  value  of  c  is  0,  then  set  the  decision  to  0.  If  the  value  of  c  is  1 ,  then  set  the  decision 
to  the  complement  of  the  input. 

A4.  Write  the  decision  into  e. 

A5.  Write  1  into  /. 

Processes  3 . n  execute  B 1  and  B2  below. 

Bl.  Repeatedly  read / until  its  value  is  1. 

B2.  Read  e  and  set  the  decision  to  the  value  of  e. 

Theorem  3 J:  For  any  number  of  processes,  there  is  a  1 -resilient  agreement  protocol  that  uses  test-and- 
set  operations  on  four  two-valued  cells. 

Proof:  We  establish  the  correctness  of  Protocol  3  for  computations  in  which  at  most  one  process  dies. 

By  the  correctness  of  Protocol  2,  both  process  1  and  process  2  choose  the  same  decision  v.  Since  at  most 
one  process  dies,  either  process  1  or  process  2  sets  cell  e  to  v  and  cell /to  1.  Once /is  set  to  1,  the  value 
of  e  will  not  change.  Therefore,  processes  3, ....  n  will  eventually  read  e  and  set  their  decisions  to  v.D 

Section  2.5.3  shows  that  there  is  no  2-resilient  test-and-set  protocol  with  two-valued  cells. 

2.4  Impossibility  of  Fully  Resilient  Agreement 
2.4.1  Properties  of  fully  resilient  protocols 

Let  n  be  a  fully  resilient  agreement  protocol  on  the  process  system  with  n  processes.  Let  T  be  the 
computation  graph  for  n. 

Lemma  4.1:  T  is  acyclic. 

Proo'-  Suppose,  on  the  contrary,  T  has  a  directed  cycle 

S  t,  52 . S  1 

The  arcs  in  this  cycle  define  an  infinite  sequence  of  process  transitions  because  the  participating  processes 
do  not  have  decisions.  Because  this  computation  does  not  terminate,  this  contradicts  the  hypothesis  that 
fl  is  a  fully  resilient  agreement  protocol.O 
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For  a  node  s  and  a  process  j,  consider  the  directed  path  in  T  that  starts  at  s  and  follows  only  arcs 
labeled  j.  Since  n  is  fully  resilient,  it  achieves  agreement  even  if  all  processes  but  j  die;  hence,  this  path 
must  end  at  a  system  state  t  in  which  process  j  has  a  decision.  That  is,  state  ( j,t )  e  {E°j,E}}.  Define 
endpath  (J,s)  to  be  this  system  state  t. 

Lemma  4.2:  There  is  a  bivalent  initial  system  state. 

Proof:  Let  G  and  H  be  processes.  Let  so  be  an  initial  system  state  in  which  the  input  of  process  G  is  0 
and  the  input  of  H  is  1.  Let  sG  =  endpath  (G,sq)  and  Sh  -  endpath  ( H,s0 ).  When  G  reaches  its  state  in  sG 
it  must  have  a  decision  that  should  be  valid  even  if  all  other  processes  died.  Thus  the  decision  of  G  in  sG 
is  0.  It  follows  that  the  decisions  of  all  processes  in  all  leaves  that  are  successors  of  sG  must  be  0. 
Similarly  the  decisions  of  all  processes  in  all  leaves  that  are  successors  of  sH  must  be  l.D 

Lemma  4.3:  Every  bivalent  node  has  n  children. 

Proof:  Let  j  be  a  node  with  fewer  than  n  children.  There  is  some  process  j  with  no  transition  from  s. 
Then  state  (j,s)  is  either  E°  or  E) .  In  the  first  case  the  decision  of  every  process  in  every  leaf  reachable 
from  s  must  be  0,  in  the  second  case  1.  Consequently,  s  is  univalent.D 

Lemma  4.4:  There  is  a  bivalent  node  s  whose  children  are  all  univalent. 

Proof:  Let  W  be  the  bivalent  nodes  of  T.  By  Lemma  4.2,  W  is  not  empty.  By  Lemma  4.3,  each  node  in 
W  has  n  children.  If  every  node  in  W  had  a  child  in  W,  then  since  I~  is  acyclic,  T  would  be  infinite. 
Because  T  is  finite,  there  is  a  node  j  in  W  such  that  all  children  of  s  are  not  in  W. □ 

Lemma  4.4  asserts  the  existence  of  a  crucial  node  s  such  that  every  transition  from  s  compels  an 
irrevocable  decision.  We  show  that  each  of  these  transitions  must  change  the  value  of  a  cell. 

Lemma  4.5:  Let  j  be  a  process  and  s  and  t  be  system  states  such  that  state  (J,s)  =  state  (J,t)  and 
value  (c,s)  =  value  (c,t)  for  all  memory  cells  c.  Then  the  decisions  of  process  j  in  endpath  (J,s)  and  in 
endpath  (J,t)  are  the  same. 
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Proof:  Let  s  =  A (J,s)  and  t  -  A(J,t).  The  hypotheses  on  s  and  t  guarantee  that  state  (J,s  )  =  state  (J,t ) 
and  value  ( c,s  )  =  value  (c.t )  for  all  memory  cells  c.  By  induction,  the  states  of  process  j  in  endpath  (J,s) 
and  in  endpath  (j,t)  are  the  same.D 

Lemma  4.6:  Let  s  be  a  bivalent  node  whose  children  are  all  univalent  For  every  process  j,  the  transition 
by  j  from  s  changes  the  value  of  the  cell  that  j  accesses  in  s. 

Proof:  Let  George  (G)  and  Hannah  (H)  be  any  processes  such  that  w  =  A (G,s)  is  a  0-valent  child  of  s  and 
x  =  A (H,s)  is  a  1-valent  child  of  s.  Let  cc  =  cell  ( state  ( G,s ))  and  c#  =  cell  ( state  ( H,s )).  If,  to  the 
contrary,  the  transition  by  George  from  s  to  w  does  not  change  the  value  of  cG,  then  for  every  cell  d, 
value  (d,s)  =  value  (d,w).  Consequently,  by  Lemma  4.5,  since  state  (£/,w)  =  state  ( H,s ),  the  decisions  of 
Hannah  in  endpath  ( H,w )  and  in  endpath  ( H,s )  would  be  the  same.  But  this  is  impossible  because  w  is 
0-valent  and  x  is  1-valent.  Similarly,  the  transition  by  Hannah  from  s  to  x  changes  the  value  of  cH.0 

2.4.2  Read/write  protocols 

Theorem  4.1:  There  is  no  fully  resilient  read/write  agreement  protocol  for  n  ^  2. 

Proof:  Suppose  n  is  a  fully  resilient  read/write  agreement  protocol.  In  the  computation  graph  for  n 
Lemma  4,4  guarantees  the  existence  of  a  bivalent  node  s  whose  children  are  all  univalent.  Choose  a  0- 
valent  child  w  of  s  and  a  1-valent  child  x  of  s.  Let  George  (G)  and  Hannah  (H)  be  the  processes  such  that 
w-  =  A(G, s)  and  x  =  A(H,s).  Lety  =  A(//,w)  and  z  =  A(G,x).  Since  w  is  0-valent,  y  is  0-valent.  Since  x  is 
1-valent,  z  is  1-valent.  See  Figure  2.1. 

Let  qG  =  state  (G,s)  and  qH  =  state  ( H,s ).  Lemma  4.6  guarantees  that  both  qG  and  qH  are  WRITE 
states.  Let  cG  =  cell  ( qG )  and  cH  =  cell  (<?#).  We  examine  two  cases,  both  of  which  lead  to 
contradictions. 

Case  l :  cG*cn.  Since  the  transitions  by  George  and  by  Hannah  affect  different  memory  cells, 
system  states  y  and  z  are  the  same.  But  y  =  z  cannot  be  both  0-valent  and  1-valent. 

Case  2:  cG  =  cH.  Since  the  transition  by  George  from  x  to  z  obliterates  the  value  written  by  Hannah 
in  the  transition  from  s  to  x,  value  ( cG,w )  =  value  ( cG,z ).  Indeed,  value  (d,w)  =  value  ( d,z )  for  all  cells  d. 
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Furthermore,  George’s  state  is  the  same  in  both  w  and  z.  By  Lemma  4.5,  George’s  decision  in 
endpath  (G.w)  is  the  same  as  in  endpath  (G,z).  But  this  is  impossible  because  w  is  0-valent  and  z  is  1- 
valentO 

Theorem  4. 1  is  similar  in  spirit  to  Theorem  1  of  Johnson  and  Schneider  [32]. 

2.4_3  Test-and-set  protocols 

1  Theorem  4.2:  There  is  no  fully  resilient  test-and-set  agreement  protocol  if  nS3  and  memory  cells  have 

|  only  two  values. 

Proof:  Suppose  n  is  a  fully  resilient  test-and-set  agreement  protocol  for  a  system  with  n  £  3  processes 
|  and  two-valued  memory  cells.  In  the  computation  graph  for  n  Lemma  4.4  guarantees  the  existence  of  a 

bivalent  node  s  whose  children  are  all  univalent  Choose  a  0-valent  child  w  of  s  and  a  1 -valent  child  x  of 
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s.  Let  George  (G),  and  Hannah  (H)  be  the  processes  such  that  w  =  A(G,s)  andx  =  A {H,s).  Let  y  = 

A (H,w)  and  z  =  A(G,x).  Since  w  is  0-valent,  y  is  0-valent;  since  x  is  1 -valent,  z  is  1 -valent.  See  Figure 
2.1.  Let  cG  =  cell  {.state  (G,s))  and  cH  =  cell  {state  {H,s)).  We  examine  two  cases,  both  of  which  lead  to 
contradictions. 

Case  1:  cc  *  ■  Since  George  and  Hannah  access  different  cells,  system  states  y  and  z  are  the 

same.  But  y  =  z  cannot  be  both  0-valent  and  1 -valent. 

Case  2:  cG  =  c«.  Let  c  =  cG  =  cH.  Because  c  can  attain  only  two  different  values  and  because,  by 
Lemma  4.6,  both  the  transition  by  George  to  w  and  the  transition  by  Hannah  to  x  change  the  value  of  c,  it 
follows  that  value  {c,w)  =  value  (c,x).  Since  n£3,  there  is  a  process  Frances  (F)  different  from  both 
George  and  Hannah.  By  Lemma  4.5,  since  state  { F,w )  =  state  {F,x)  =  state  { F,s ),  the  decisions  of  Frances 
in  endpath  { F,w )  and  in  endpath  {F,x)  are  the  same.  This  is  a  contradiction  because  w  is  0-valent  and  x  is 
1 -valent.  □ 

2J  Impossibility  of  ^-Resilient  Agreement 

We  establish  that  there  is  no  1 -resilient  read/wri te  agreement  protocol  and  no  2-resilient  test-and-set 
agreement  protocol.  These  results  imply  Theorems  4. 1  and  4.2,  but  their  proofs  are  more  subtle.  The 
proofs  resemble  the  arguments  of  Dolev,  Dwork,  and  Stockmeyer  [14],  who  considered  only  message¬ 
passing  systems. 

2.5.1  Properties  of  ^-resilient  protocols 

Let  S  be  a  system  of  n  processes;  we  allow  processes  to  have  an  infinite  number  of  states.  Let  n  be 
a  ^-resilient  agreement  protocol  for  S  with  k^.  1.  Let  T  be  the  computation  graph  for  IT.  If  k<n- 1,  then 
since  II  may  allow  processes  to  wait,  T  may  have  directed  cycles.  For  example,  in  Protocol  3  in  Section 
2.3,  process  3  repeatedly  reads /until  process  1  or  process  2  writes  1  into  /.  Furthermore,  if  the  number 
of  process  states  is  infinite,  then  T  would  be  infinite.  The  proofs  in  Section  2.4  require  that  T  be  acyclic 
and  finite. 
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In  r  let  R  be  a  directed  path,  possibly  with  repeated  nodes,  from  node  s  to  node  t.  Let  nodes  ( R )  be 
the  sequence  of  nodes  visited  by  R,  including  s  and  t.  For  a  set  of  processes  P  call  R  P-free  if  R  has  no 
transitions  by  processes  in  P.  If  P  has  at  most  k  processes,  then  since  n  is  ^-resilient,  from  every  node 
there  is  a  P-free  path  to  a  node  in  which  all  processes  not  in  P  have  reached  the  same  decision. 

Lemma  5.1:  Let  s  and  t  be  univalent  nodes  such  that  value  (c,s)  =  value  ( c,t )  for  all  memory  cells 
c.  LetP=  [p:  state  (p,s)*  state  (p,t)}.  If  IP  I  <k,  then  s  and  r  are  either  both  O-valent  or  both  1 -valent. 

Proof:  Suppose  s  is  O-valent;  the  case  of  a  1-valent  s  is  similar.  Since  n  is  ^-resilient,  there  is  a  P- 

free  path  R  from  s  to  a  node  in  which  all  processes  not  in  P  have  reached  0  decisions.  Let  j\ . jr  be 

the  labels  of  arcs  of  R,  in  sequence.  Since  R  is  P-free,  none  of  ,jr  is  in  P.  Thus  from  the 
viewpoint  of  processes  system  states  s  and  t  look  the  same.  Define  s0=s  and  r0=r,  and  for  i  = 

1 . r,  s,  =  AOi.^i-t)  and  f,-  =  A(ji,ti-\).  A  straightforward  inductive  argument  yields  for  i  =  0, . . .  ,r, 

state  ( j,Si )  =  state  (j.td  for  all  j  4  P.  and  value  (c,i,)  =  value  (c,r,)  for  all  memory  cells  c.  By  construction 
of  R,  in  sr  all  processes  not  in  P  have  reached  0  decisions.  Consequently,  since  state  ( j,sr )  =  state  (j,tr) 
for  all  j  4  P,  in  tr  all  processes  not  in  P  have  reached  0  decisions.  Therefore,  since  t  is  univalent,  t  must 
be  0-valentD 

Lemma  5.2:  There  is  a  bivalent  initial  system  state. 

Proof:  Suppose,  to  the  contrary,  every  initial  system  state  is  univalent.  Since  n  is  an  agreement 
protocol,  the  initial  system  state  in  which  all  processes  have  input  0  is  O-valent.  Similarly,  the  initial 
system  state  in  which  all  processes  have  input  1  is  1-valent.  By  changing  each  input  from  0  to  1  one  at  a 
time,  we  obtain  initial  system  states  s o  and  s  j  such  that  r0  >s  O-valent,  s  i  is  1-valent,  and  there  is  exactly 
one  process  h  such  that  state  (h,j0)  *  state  ( h,s  i ).  Since  n  is  fc-resilient,  it  is  at  least  1 -resilient,  and  we 
have  obtained  a  contradiction  of  Lemma  5.1.D 


22 


2 £J2  1-Resilient  read/write  protocols 

Throughout  this  section  assume  that  n  is  a  1 -resilient  read/write  agreement  protocol  and  that  T  is 
the  computation  graph  for  IT.  We  shall  construct  a  path  in  r  in  which  every  process  takes  infinitely  many 
transitions,  but  no  process  reaches  a  decision. 

Lemma  5  J:  For  every  bivalent  node  s  and  every  process  j  there  is  a  node  t  reachable  from  s  such 
that  A (J,t)  is  bivalent. 

Proof:  Let  George  (G)  be  a  process.  Suppose,  to  the  contrary,  for  every  t  reachable  from  s,  A(G,t ) 
is  univalent.  In  particular,  A (G,s)  is  univalent.  Without  loss  of  generality,  assume  A(G,s)  is  0-valent. 

Since  s  is  bivalent,  there  is  a  path  from  r  to  a  1-leaf.  Let  s  be  the  node  on  this  path  such  that 
state  (G,  A(G, s  ))  =  Eq.  Node  A (G,s  )  is  1-valent. 

Let  R  be  a  path  from  s  to  s'.  By  assumption,  A (G,r)  is  univalent  for  all  t  in  nodes  (/?).  Since  A (G,s) 
is  0-valent  and  A (G,s  )  is  1-valent,  it  is  straightforward  to  prove  by  induction  on  the  length  of  R  that  there 
exist  consecutive  nodes  u  and  x  in  nodes (R)  such  that  A(G,«)  is  0-valent  and  A (G,x)  is  1-valent  Let 
Hannah  (H)  be  the  process  such  that  x  =  A(H,u).  Let  w  =  A (G,«),  y  =  A (H.w),  and  z  =  A (G,x).  Since  w  is 
0-valent,  y  is  0-valenL  See  Figure  2.2. 

Let  qc  =  state  (G.u)  =  state  (G,x)  and  q»  =  state  ( H,u ).  First,  we  claim  that  qG  and  qH  must  be 
WRITE  states.  Suppose,  to  the  contrary,  qG  is  a  READ  state.  Since  the  values  of  all  cells  are  the  same  in 
u  and  in  w,  Hannah  undergoes  the  same  transition  from  both  u  and  w.  Consequently,  the  values  of  all 
cells  are  the  same  in  y  and  in  r,  hence,  the  values  of  all  cells  are  the  same  in  y  and  in  z.  The  states  of  all 
processes  other  than  George  are  the  same  in  y  and  in  z.  This  is  a  contradiction  of  Lemma  5 . 1  because  y  is 
0-valent  and  z  is  1-valent.  Thus  qa  must  be  a  WRITE  state.  Similarly,  qH  must  be  a  WRITE  state. 

Let  cc  -  cell  (q g )  and  ch  -cell  (<?#).  We  examine  two  cases,  both  of  which  lead  to  contradictions. 

Case  I:  cq  *  cH.  Since  the  transitions  by  George  and  by  Hannah  affect  different  memory  cells, 
system  states  y  and  z  are  the  same.  But  y  =  z  cannot  be  both  0-valent  and  1-valent. 
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Figure  2.2.  The  Transitions  by  G  and  H  from  u 


Case  2:  cq~  Q/.  Since  the  transition  by  George  from  x  to  z  obliterates  the  value  written  by  Hannah 
in  the  transition  from  u  to  x,  all  cells  have  the  same  values  in  w  and  in  z.  Furthermore,  all  processes  other 
than  Hannah  are  in  the  same  state  in  w  and  in  z.  This  contradicts  Lemma  5. 1  because  w  is  O-valent  and  z 
is  1 -valent.  □ 

Theorem  5.1:  There  is  no  1 -resilient  read/write  agreement  protocol. 

Proof:  Suppose  there  is  a  1 -resilient  read/write  agreement  protocol  n  for  a  system  with  n  processes 
numbered  0, . . . ,  n-l.  Let  so  be  a  bivalent  initial  system  state  guaranteed  by  Lemma  5.2.  Define  infinite 

sequences  of  paths  Ro,R\....  and  nodes  as  follows:  for  i  =  0,1 . use  Lemma  5.3  to  find  a  path 

R,  starting  from  r,  to  a  node  r,  such  that  A((i  mod  *),;,)  is  bivalent,  and  define 

ri+i=A((i  mod  n),r,). 

The  concatenation  of  nodes  (Rq),  nodes  (R  x ), ...  is  a  computation  of  fl  in  which  every  process  makes 
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infinitely  many  transitions,  but  no  process  reaches  a  decision.  Thus  n  may  not  terminate. 

Contradiction.n 

2S3  2-Resilient  test-and-set  protocols 

Consider  a  system  with  /j£ 3  processes  in  which  cells  have  only  two  values.  Suppose  n  is  a  2- 
resilient  test-and-set  agreement  protocol.  First  we  establish  two  Lemmas  about  the  computation  graph  of 

n. 

Lemma  5.4:  No  bivalent  node  has  both  a  0-valent  child  and  a  I-valent  child. 

Proof:  Suppose,  to  the  contrary,  bivalent  node  s  has  a  0-valent  child  w  and  a  1-valent  child  x.  Let 
George  (G),  and  Hannah  (//)  be  the  processes  such  that  w  =  A (G,s)  and  x  =  A (H,s).  Let  y  =  A (H,w)  and  z 
=  A (G,x).  Since  w  is  0-valent,  y  is  0-valent;  since  x  is  1  -valent,  z  is  1  -valent.  See  Figure  2. 1 . 

Let  cg  =  cell  (. state  (G,s))  and  cH  -  cell  ( state  (H,s)).  First,  we  claim  that  cq  =  cH.  If  cc*cH,  then 
the  transitions  by  George  and  by  Hannah  affect  different  cells;  hence,  y  and  z  are  the  same.  But  y  =  z 
cannot  be  both  0-valent  and  1-valent. 

Let  c  -  cq  -  cH.  There  are  three  cases,  all  of  which  lead  to  contradictions. 

Case  1 :  value (c,w)  =  value ( c,s ).  The  transition  by  Hannah  from  s  to  x  is  the  same  as  from  w  to  y. 
All  cells  have  the  same  values  in  x  and  in  y.  All  processes  other  than  George  are  in  the  same  states  in  x 
and  in  y.  Apply  Lemma  5.1  to  x  and  y  to  obtain  a  contradiction. 

Case  2:  value(c,x)  =  value  {c,s)-  Similar  to  Case  1. 

Case  3:  value  (c,w)  *  value  (c,s)  and  value  (c,x)  *  value  ( c,s ).  Since  c  can  attain  only  two  values, 
value  (c,w)  =  value  (c,x).  Thus  all  cells  have  the  same  values  in  w  and  in  x.  All  processes  other  than 
George  and  Hannah  are  in  the  same  states  in  w  and  in  x.  Since  n  is  2-resilient  and  there  are  at  least  3 
processes,  apply  Lemma  5.1  to  w  and  x  to  obtain  a  contradiction. □ 

Lemma  S3:  For  every  bivalent  node  s  and  every  two  distinct  processes  i  and  j,  there  is  a  path  from 
j  to  a  node  t  such  that  either  A (i,t)  or  A(y',0  is  bivalent. 
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Proof:  Let  George  (G)  and  Hannah  (H)  be  two  distinct  processes.  Suppose,  to  the  contrary,  for 
every  t  reachable  from  s,  A (G,r)  and  A(H,t)  are  univalent.  Without  loss  of  generality  assume  A(G,s)  is 
O-valenL  By  Lemma  5.4,  A (H,s)  is  O-valenL 

Since  s  is  bivalent,  there  is  a  path  from  s  to  a  1-leaf.  Without  loss  of  generality,  assume  that  George 
reaches  a  1  decision  before  Hannah  does  on  this  path.  Let  s  be  the  node  on  this  path  such  that 
state  (G,  A(G,s  ))  =  Ela.  Node  A (G,s  )  is  1-valenL  By  Lemma  5.4,  A (H,s  )  is  either  bivalent  or  l-valent. 
By  assumption,  since  s  is  reachable  from  s,  A {H,s  )  is  1-valenL 

Let  R  be  a  path  from  s  to  s'.  By  assumption,  A(G,r)  and  A (H,t)  are  both  univalent  for  all  t  in 
nodes  (R).  Therefore,  there  exist  consecutive  nodes  rt  and  r2  in  nodes  (R)  such  that  both  A(G,ri)  and 
A (H,ty)  are  O-valent,  and  both  A(G,f2)  and  A(H,t2)  are  1-valenL  Let  Frances  (F)  be  the  process  such  that 
r2  =  A(F,f1). 

Let  r3  =  A(G,ri),  r4  =  A(G,r2),  ts  =  A (//,r2),  t6  =  A (F,r3),  r7  =  A(tf,r3),  and  rg  =  A(H,t6).  Nodes 
t6,  r7,  and  ts  are  O-valent  because  f3  is  0-valenL  See  Figure  2.3.  Let  cF  =  cell ( state  (F,t  i)),  cg  = 
cell  ( state  (G,  r  i )),  and  cH  =  cell  ( state  (H,  1 1 )). 

First  we  claim  that  cF  =  cG.  If  cF  *  cq,  then  system  states  r4  and  t6  would  be  the  same.  But  r4  is 
l-valent  and  r$  is  O-valent. 

Let  c  =  cF  =  cc-  There  are  three  cases,  all  of  which  lead  to  contradictions. 

Case  I:  value (c,tz)  =  value (c,t i).  Then  George  undergoes  the  same  transition  from  ^  to  r3  as 
from  f2  to  r4.  All  cells  have  the  same  values  in  r3  and  in  r4.  All  processes  other  than  Frances  are  in  the 
same  states  in  f3  and  in  f4.  Apply  Lemma  5.1  to  f3  and  f4  to  obtain  a  contradiction. 

Case  2:  val«e(c,r3)  =  value  (c,t\).  The  transitions  of  Frances  and  Hannah  are  the  same  from  rt  to 
ts  as  from  r3  to  rg.  All  cells  have  the  same  values  in  t j  and  in  rg.  All  processes  other  than  George  are  in 
the  same  states  in  ts  and  in  t%.  Apply  Lemma  5. 1  to  ts  and  rg  to  obtain  a  contradiction. 

Case  3:  value{.c,tz)  *  value (c,f  5 )  and  value(c,tj )  *  value (c,tt).  Since  c  can  have  only  two 
different  values,  value{c,t2)  =  value (c,r3).  Hannah  undergoes  the  same  transition  from  r2  to  r5  as  from 
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Figure  2.3.  The  Transitions  by  F,  G,  and  H 


t3  to  t-j.  All  cells  have  the  same  values  in  r  j  and  in  r7.  All  processes  other  than  George  and  Frances  are 
in  the  same  states  in  tj  and  in  *7.  Apply  Lemma  5.1  to  r 5  and  1 7  to  obtain  a  contradiction. □ 

Theorem  52:  There  is  no  2-resilient  test-and-set  agreement  protocol  if  n£3  and  memory  cells  have 
only  two  values. 

Proof:  Suppose  n  is  a  2-resilient  test-and-set  agreement  protocol.  To  obtain  a  contradiction,  we 
construct  a  computation  of  n  in  which  at  least  n-1  processes  make  infinitely  many  transitions,  but  no 
process  reaches  a  decision.  Let  so  be  a  bivalent  initial  system  state  guaranteed  by  Lemma  5.2.  Let  P 0  = 
PROC  ( S ).  Define  infinite  sequences  of  paths  R$,R  1 ....  and  nodes  s  \  as  follows: 
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for  i  =  0,1,...  do 

if  I  Pi  I  =1  then  P(  :=  PROC  (S)  end  if; 

Choose  any  Gt  and  Ht  in  P,; 

Use  Lemma  5.5  to  find  a  path  R,  starting  from  j;  to  a  node  r,  such  that 

either  A(G,,r,)  orA(//,,r,)  is  bivalent; 

if  A(G;,f;)  is  bivalent 

then  j1+1  ;=  A(G, PI+i  :=  PHGJ 

else  if  A(//,,r,)  is  bivalent 

then  sI+i  :=A(//„r,);  P,+1  ;=  PHHJ 

end  if 

end  for 

By  construction,  r  t , . . .  ,s„-i  are  defined  by  transitions  of  n-1  different  processes;  s„, . . .  .^2«-2  are 
defined  by  transitions  of  n-1  different  processes;  and  so  on.  Ergo,  the  concatenation  of  nodes  ( R  o), 
nodes  (R  i ), ...  is  a  nonteiminating  computation  of  n  in  which  at  most  one  process  makes  only  finitely 
many  transitions.  □ 

2.6  Summary 

Table  2.1  summarizes  our  results  on  the  existence  of  agreement  protocols  for  asynchronous  shared 

memory  systems  in  which  processes  die  undetectably. 

Theorem  5.1  implies  the  impossibility  of  1 -resilient  Byzantine  agreement  in  asynchronous 

message-passing  systems  because  a  shared  memory  can  simulate  message-passing.  More  precisely,  if 

there  were  a  1 -resilient  agreement  protocol  for  an  asynchronous  message-passing  system  with 

undetectable  process  death,  then  we  could  transform  it  into  a  1 -resilient  read/write  protocol  on  a  shared 

memory.  For  every  pair  of  processes  (i,j)  with  i*j,  allocate  a  group  of  cells  to  hold  all  the  messages  that 


Table  2.1.  Summary  of  the  Results  of  Chapter  2 


Type  of  Protocol 

1 -resilient 

/fc-resilient  for  k> 2 

Read/write 

No  (Theorem  5.1) 

laflhiigagsmiai 

Test-and-set 
three-  valued  cells 

Yes  (Theorem  3.1) 

Yes  (Theorem  3.1) 

Test-and-set 
two-valued  cells 

Yes  (Theorem  3.3) 

No  (Theorem  5.2) 
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process  i  could  send  to  process  j.  To  simulate  the  sending  of  a  message,  process  i  writes  into  the  next 
unused  cell  in  this  group. 

Recently  Bums  (personal  communication)  generalized  our  results.  He  determined  relationships 
between  the  number  of  possible  inputs  and  the  number  of  memory  cell  values  required  to  guarantee 
agreement.  Furthermore,  his  protocols  use  the  weak  test-and-set  operation. 
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CHAPTER  3 

FAULT-TOLERANT  DISTRIBUTED  ALGORITHM 
FOR  ELECTION  IN  COMPLETE  NETWORKS 

3.1  Introduction 


We  consider  the  election  problem  on  asynchronous  complete  networks  when  the  processors  are 
reliable  but  some  of  the  links  may  be  faulty.  The  failure  type  we  consider  is  more  malicious  than  fail-stop 
failure  but  less  malicious  than  Byzantine  failure.  The  faulty  links  fail  by  losing  messages  at  will.  Thus  a 
faulty  link  may  act  as  an  adversary  who  deletes  a  message  at  the  most  inopportune  time.  Since  the 
network  is  asynchronous,  the  link  delays  are  arbitrary,  hence,  the  processors  cannot  distinguish  between 
slow  links  and  faulty  links.  In  other  words,  the  faulty  links  are  undetectable.  We  call  the  type  of  failure 
that  we  consider  intermittent.  Bar-Yehuda  et  al.  [7]  solve  the  same  problem  for  fail-stop  link  failure. 
They  also  assume  that  if  a  link  fails,  then  it  fails  before  the  execution  of  the  algorithm.  Their  algorithm 
does  not  tolerate  intermittent  link  failure,  and  there  is  no  easy  way  to  generalize  their  algorithm  to  handle 
intermittent  failure.  Our  work  is  independent  of  their  work. 


Let  n  be  the  number  of  processors  in  the  network.  Let  /be  the  maximum  number  of  faulty  links, 
where  1  5  /  S  [  -j-3j  ,  and  let  r  be  a  design  parameter,  where  l+-~  SrS  We  develop  311 

asynchronous  algorithm  that  runs  in  time  0(— -— ),  uses  0{  nrf+~--~\og(-^  — -) )  messages,  and 

uses  at  most  <9(log  |  T  | )  bits  per  message,  where  |  T  |  is  the  cardinality  of  the  set  of  node  identifiers.  The 
value  of  r  that  minimizes  the  total  number  of  messages  in  our  algorithm  is  r 

=  min(  1+C(  ),  where  C  =  l+O (  -  S/-).  por  eVery  value  of  n  and  /subject  to 

/  2/  log  n 

f  ^  (.  3j  ,  £  C  £  1 .  Thus  the  minimum  number  of  messages  that  our  algorithm  uses  is 

Z  o 


0(n/+nlogn).  Since  the  complete  network  algorithm  of  Bar-Yehuda  et  al.  [7]  also  uses  ©(«/+  nlog  *) 
messages,  our  algorithm  subsumes  theirs. 
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3.2  Model 

Our  model  follows  Goldreich  and  Shrira’s  model  [25].  Consider  an  asynchronous  complete 
network  of  n  processors.  We  model  the  network  as  a  complete  graph  on  n  nodes,  in  which  each  node 
represents  a  processor,  and  each  edge  represents  a  bidirectional  communication  link.  Henceforth,  we  will 
not  distinguish  between  a  node  and  the  processor  it  represents,  and  we  will  not  distinguish  between  an 
edge  and  the  link  it  represents.  Each  node  u  has  a  unique  identifier,  ID(u),  chosen  from  a  totally  ordered 
set.  No  node  initially  knows  the  identifier  of  any  other  node,  but  the  nodes  know  that  the  network  is 
complete.  When  a  node  u  wishes  to  communicate  with  a  node  v,  then  u  sends  a  message  to  v  on  the  link 
l(u,v)  joining  them. 

A  distributed  algorithm  on  the  network  is  a  set  of  n  deterministic  local  programs,  each  assigned  to  a 
node.  Each  local  program  consists  of  computation  statements  and  communication  statements.  The 
computation  statements  control  the  internal  computations  of  a  node.  The  communication  statements  are 
of  the  form  "send  message  M  on  link  l"  or  "receive  message  M  on  link  /.”  Each  node  u  has  a  Send- 
BufferO,/)  and  a  Receive-Bufferfu,/)  associated  with  each  link  l  incident  on  u,  where  the  buffers  are  not 
necessarily  first-in  first-out.  Let  l  be  l(u,v).  When  u  wishes  to  send  a  message  Af  on  U  places  Af  in 
Send-Bufferf «,/).  We  call  this  event  a  send  event.  To  capture  the  asynchronous  nature  of  our  network, 
messages  may  remain  in  the  send-buffers  for  arbitrary  lengths  of  time.  A  transmission  event  in  /  occurs 
when  l  places  Af  in  Receive-Bufferfv,/).  We  assume  that  u  can  not  inspect  Send-Buffer(u,f)  to  check 
whether  Af  was  removed  from  the  buffer.  Hence  Af  is  in  transit  from  u  to  v  if  Af  is  in  Send-BufferCa,/).  If 
u  wishes  to  process  a  message  M  on  /,  then  u  removes  Af"  from  Receive-Bufferfu,/).  We  call  this  event  a 
receive  event.  For  convenience,  we  assume  that  it  takes  one  time  unit  to  remove  M  from  Receive- 
Buffer(«,/)  and  to  execute  the  computation  statements  on  Af  .  If  A/’  is  not  in  Receive-Buffer(u,/),  then  u 
either  waits  for  Af  ,  or  u  receives  some  other  message,  depending  on  u' s  local  program.  Note  that  when 
we  say  that  node  u  receives  a  message,  we  mean  that  u  removes  the  message  from  a  Receive-Buffer  and 
processes  the  message.  A  loss  event  in  a  link  l  is  the  event  of  l  discarding  a  message. 
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Consider  a  particular  execution  E  of  a  distributed  algorithm.  Let  Events (E)  be  the  multiset  of  the 
events  in  E.  For  convenience,  we  assume  the  existence  of  a  global  clock  that  gives  the  time  ,\t  which  each 
event  in  E  occurs.  Although  this  clock  is  available  to  an  observer  of  the  network,  the  nodes  do  not  know 
of  its  existence.  We  will  assume  that  each  event  in  E  occurs  at  some  discrete  unit  of  time  starting  from 
zero.  Let  Events(u)  be  the  multiset  of  u’s  send  and  receive  events  in  E.  The  local  program  in  u  induces  a 
total  ordering  on  Events(u).  Unlike  Goldreich  and  Shrira  [25],  two  events,  each  in  a  distinct  node,  may 
occur  at  the  same  time.  We  say  that  a  link  l  is  faulty  in  E  if  /  experiences  at  least  one  loss  event  in  E.  In 
this  chapter,  we  assume  that  at  most  /links  fail  during  any  execution.  Note  that  the  faulty  links  may  be 
different  in  different  executions  of  the  algorithm.  If  a  link  is  not  faulty,  then  it  is  reliable.  We  require 
delays  on  reliable  links  to  be  finite.  In  other  words,  messages  sent  on  reliable  links  must  eventually  be 
delivered.  Because  of  the  asynchronous  nature  of  the  network,  a  node  cannot  distinguish  between  a  slow 
link  and  a  faulty  link  that  loses  messages.  Therefore,  the  nodes  cannot  detect  the  faulty  links. 

In  this  chapter,  we  will  construct  a  distributed  election  algorithm  such  that,  at  the  end  of  every 
execution  of  the  algorithm,  a  unique  node  is  elected  as  a  leader  of  the  network.  Also,  all  the  nodes  will 
know  the  identifier  of  the  leader.  We  require  that  all  the  n  local  programs  that  form  the  election  algorithm 
be  identical. 

One  of  the  efficiency  measures  that  we  will  use  is  the  maximum  running  time  of  an  asynchronous 
algorithm.  Although  we  assume  that  the  link  delays  are  arbitrary  when  we  design  the  algorithm,  we  set 
the  link  delays  to  be  at  most  one  time  unit  when  we  compute  the  running  time  of  the  algorithm.  For 
convenience,  we  also  assume  that  all  receive  events  take  zero  time  units  when  we  compute  the  running 
time.  This  assumption  is  reasonable  in  real  systems  where  the  nodes’  processing  time  is  negligible 
compared  with  link  delays. 

3J  Informal  Description  of  the  Algorithm 

We  present  the  formal  description  of  the  algorithm  in  Section  3.7. 
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3_3.1  Definitions 

Let  n  be  the  number  of  nodes,  /be  the  maximum  number  of  faulty  links,  and  r  be  a  design 
parameter.  Consider  a  particular  execution  £  of  the  algorithm.  Each  node  tries  to  eliminate  all  the  other 
nodes  from  the  competition  to  be  the  leader.  The  unique  node  that  survives  all  eliminations  elects  itself 
as  the  leader.  If  a  node  u  is  eliminated  at  some  time  r,  then  we  say  that  u  is  dead  at  t.  If  u  is  dead  at  r, 
then  u  is  dead  at  all  times  t  >  t,  and  u  does  not  try  to  eliminate  any  other  node  after  t.  If  u  is  not  dead  at  t, 
then  u  is  live  at  t.  All  nodes  are  live  at  the  beginning  of  the  algorithm.  We  assume  that  all  the  nodes  start 
executing  the  algorithm  simultaneously.  (For  the  case  when  only  a  subset  of  the  nodes  starts  executing 
the  algorithm  or  when  the  nodes  start  executing  the  algorithm  at  different  times,  we  can  easily  modify  the 
algorithm  by  requiring  sleeping  nodes  to  be  eliminated  a*  soon  as  they  receive  any  message.) 


Each  node  u  keeps  an  integer  variable  called  phase  {u).  The  value  of  phase(u)  ranges  from  0  to 
f  2(+\)f  \  +*'  tatuitivdy,  if  u  is  live  at  some  time  r,  then  the  value  of  phase  (u)  at  t  is  a  lower  bound  on 
the  number  of  nodes  that  u  eliminated  by  time  f.  A  node,  whether  live  or  dead,  cannot  decrease  its  phase. 


A  live  node  may  increment  its  phase  to  reflect  an  increase  in  the  number  of  nodes  it  has  eliminated. 


If  u  eliminates  a  node  v,  then  we  say  that  u  is  a  suppressor  of  v  and  that  v  is  a  victim  of  u.  Node  v 
may  have  several  suppressors  during  the  execution  £.  It  is  possible  for  a  node  to  be  its  own  suppressor. 
For  every  time  t,  a  live  node  at  t  does  not  have  any  suppressor  at  or  before  t. 


The  algorithm  maintains  the  following  invariant,  called  the  Algorithm  Invariant:  Node  u  becomes 
a  suppressor  of  a  node  v  at  rime  t  only  if  all  previous  suppressors  ofv  are  dead  at  t.  Thus  each  node  v 
keeps  a  link  pointer  called  Suppressor-Link(v)  that  points  to  the  most  recent  suppressor  of  v.  Initially, 
Suppressor-Link(v)  =  nil.  A  dead  node  v  may  increment  phase(v)  to  give  a  lower  bound  on  the  phase  of 
the  most  recent  suppressor  of  v. 


When  a  node  u  decides  that  u  is  the  leader,  then  u  informs  all  the  nodes.  Each  node  v  halts  v’s 
execution  of  the  algorithm  when  v  learns  of  the  existence  of  a  leader. 
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3.3.2  The  algorithm 


Each  node  u  keeps  a  set  called  Untrav(u)  initially  containing  the  names  of  the  links  incident  on  u. 
Before  executing  the  algorithm,  the  phase  of  each  node  a  is  0.  As  soon  as  a  starts  executing  the 
algorithm,  a  sets  phase(ju)  to  1.  Each  node  a  repeats  the  following  until  phase(u)  becomes 
n+2 


r 


2(r-l)/ 


1+1: 


Suppose  that,  for  some  noae  a,  phase(u )  becomes  i  at  some  time  t  and  a  is  live  at  t,  where 


isi^r 


n+2 

2(r-l)/ 


1  .  If  i  =  1,  then  a  chooses  f  rf]  link  names  from  Untrav(u)  and  deletes  them  from 


Untrav(u).  If  i  >  1,  then  a  chooses  f  rf]  -/link  names  from  Untrav(u)  and  deletes  them  from  Untrav(u). 
Call  the  chosen  links  Chosen-Links(u,i).  Node  a  sends  the  message  “Eliminate-O'/^Ku))”  on  each  link 
in  Chosen-Links(u,i).  Call  the  multiset  of  these  eliminate-messages  New(u.i).  (For  simplicity,  we  will 
use  r/instead  off  rf\  ,  and  (r-l)/instead  off r/]  -f). 

For  each  node  v  *  a,  if  v  receives  the  message  M i =‘ ‘Eliminate-O'/D (a)) ”  at  some  time  fv,  then  v 
compares  (i7D(a))  with  (jJDiy ))  lexicographically,  where  j  is  the  value  of  phaseiy)  at  fv: 

Case  1:  (i,ID(u))  <  (j,ID(v)J:  Node  v  sends  the  message  M2~" Elimination-Unsuccessful-(i7D («)) ”  on 
l(y,u).  If  a  receives  M2  at  some  time  then  there  are  two  cases: 

Case  1.1:  u  has  a  suppressor  at  v  Then  a  discards  M 2. 

Case  1.2:  u  has  no  suppressor  at  ty :  Since  a  may  have  incremented  phase{u)  after  a  sent  M \ ,  then  a 
compares  i  with  k,  where  k  is  the  value  of  phase(u)  at  tu.  If  k  =  m,  then  a  becomes  eliminated  at  tu,  and  a 
sets  Suppressor-Link(u)  to  /(a, a)  (that  is,  to  itself)  to  indicate  that  a  becomes  its  own  suppressor.  If  i  <  k, 
then  M2  is  an  out-of-date  message.  Thus,  a  sends  the  message  "Eliminate -(k/D (a))”  on  /(a,  v).  We  say 
that  "Eliminate-Ofc,/£>(u))”  is  a  refresher  forAf  j . 

Case  2:  (j,ID (v)J  <  (i,ID(u)):  Node  v  sets  (p hase(v)JD(v))  to  (iJD(u)).  In  addition,  v  does  one  of  the 
following: 

Case  2.1:  v  has  no  suppressor  at  i,:  Then  v  becomes  eliminated  at  rv,  and  a  becomes  v’s  suppressor  at 
rv.  Trivially,  the  Algorithm  Invariant  is  preserved.  Node  v  sets  Suppressor-Linkfv)  to  /(v,a),  and  v  sends 
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the  message  “Elimination-Successful”  on  l(y,u). 

Case  22:  v  has  a  suppressor  at  4:  Then  Suppressor -Link  (v)  *  nil.  Suppose  that  Suppressor-Link(v)  = 
l(v,w),  where  w  may  be  v.  Node  v  sends  the  message  M3=“Potential-Suppressor-(iJO(u»”  on  Z(v,w). 

If  w  receives  AT 3  at  some  time  tw,  then  w  compares  (iJD(u))  with  ( qJD(yv )),  where  q  is  the  value  of 
phase(w)  at  tw: 

Case  2.2.1:  ('i.ID(u)J  <  ('q.ID(w);:  Then  w  sends  “Potential -Suppressor-Unsuccessful-OVDCu))”  on 
/(w,v).  If  v  receives  this  message,  then  v  sends  “Elimination-Unsuccessful-OVD(u))”  on  /(v.u). 

Case  2.2.2:  (q,ID(w))  i  (i,ID(u)):  Then  w  sends  the  message  A/4=“Potential-Suppressor-Successful- 
( iJD(u)Y '  on  l(w,v).  Also,  w  sets  Suppressor-Link(w)  to  /(w,w)  at  time  tw  if  w  has  no  suppressor  at  tw. 
Otherwise,  if  w  has  a  suppressor  at  tw,  then  w  leaves  Suppressor-Link(w)  unchanged.  Node  v  may  have 
received  an  eliminate-message  from  some  node  x  *u  after  v  sent  Af  3 .  Hence  if  v  receives  Af  4,  then  v 
compares  ID(u)  with  7D(v).  If  ID(u)  -  ID(v),  then  u  becomes  a  suppressor  of  v,  and  v  sets  Suppressor- 
Link(v)  to  /(v.u).  Note  that  the  Algorithm  Invariant  is  maintained.  Also,  v  sends  “Elimination- 
Successful”  on  /(v,m).  If  ID(u)  *  ID(v),  then  v  sends  ”Elimination-Unsuccessful-(iVD(u))”  on  /(v,u). 


If  u  receives  (r-l)/‘ ’Elimination-Successful”  messages,  each  on  a  distinct  link,  and  if  u  is  live,  then  u 
increments  phase  (u)  by  1. 


If  phase{u )  becomes  f  • 


n+ 2 


:1  +1  when  u  is  live,  then  u  elects  itself  as  a  leader.  Node  u  then 


2(r-l)f  1 

sends  the  message  “  ANNOUNCE-LEADER-(/Z?(u))”  to  2/+1  nodes  and  halts.  All  the  nodes  that 
receive  this  message  send  the  message  “LEADER-IS-(/D(u»”  on  all  the  links  incident  on  them  and  halt. 
When  a  node  receives  “LEADER-IS-(/),”  for  some  j,  then  the  node  halts.  Since  there  are  at  most/ 
faulty  links,  at  least  one  node  v  adjacent  only  to  nonfaulty  links  receives  “  ANNOUNCE-LEADER- 
(lD(u)).”  Thus,  all  the  nodes  in  the  network  will  know  ID(u)  of  the  leader  u  and  will  halt. 
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3.4  Proof  of  Correctness 


The  strategy  of  our  proof  of  correctness  will  be  as  follows.  Lemma  4.5  shows  that  no  two  distinct 
nodes  can  be  elected  as  leaders  in  the  same  execution.  Lemma  4.9  shows  that  there  exists  at  least  one 
leader  in  each  execution.  Hence,  Lemmas  4.5  and  4.9  imply  that  every  execution  produces  one  leader. 


Consider  any  execution  E.  Let  t  be  a  time,  and  u,  v,  and  w  be  distinct  nodes.  Node  u  is  a  leader  at  t 


if  u  is  live  at  t  and  phaseiu)  becomes  f  ■ 


n+ 2 


-]  +1  at  or  before  t.  We  say  that  u  eliminates  v  at  t  if  v  sets 


2(r-l)/' 

Suppressor-Link(v)  to  l(y,u)  at  t.  Define  Suppressor  sly, t)  as  the  empty  sequence  if  v  is  live  at  r. 
Otherwise,  define  Suppressors{v,t)  as  the  sequence  of  all  nodes,  other  than  v,  that  eliminate  v  at  time  0, 1, 
•  •  • ,  t:  u  precedes  w  in  Suppressors^, t)  if  u  eliminates  v  at  some  time  tu,  w  eliminates  v  at  some  time 
tw,  and  0  £  tu  <  tw  £  f.  Note  that,  by  the  total  ordering  on  Events(v),  tu  *■  tw. 


The  following  lemma  shows  that  no  node  eliminates  another  node  more  than  once  in  the  same 
execution. 

Lemma  4.1:  All  the  nodes  in  Suppressors{v,t)  are  distinct 

Proof:  If  u  is  in  Suppressors(v,t),  then,  according  to  the  algorithm,  v  sends  “Elimination-Successful”  to 
u  on  /(v,u).  Since  u  sends  a  refresher  message  only  if  u  receives  an  out-of-date  message,  u  does  not  send 
any  more  eliminate-messages  on  l(u,v).D 


We  say  that  an  eliminate-message  Af  that  v  receives  from  u  is  a  fatal  message  if  M  is  the  last 
eliminate-message  that  u  sends  to  v  before  u  eliminates  v.  By  Lemma  4.1,  there  is  at  most  one  fatal 
message  for  each  pair  of  nodes. 

Lemma  4.2:  Suppose  that  u  precedes  w  in  Suppressors(v,t).  Suppose  that  u  eliminates  v  at  time  tu  <  t, 
and  that  v  receives  the  fatal  message  ”Eliminate-(tVD(w))”  from  w  at  some  time  t\ .  Then,  tu  <  t  x . 
Proof:  Note  that  tu  *  since  Events(v)  is  totally  ordered.  Suppose  that,  to  the  contrary,  r,  <tu. 

If  Suppressor  -Link  (v)  =  nil  at  ,  then  according  to  the  algorithm,  w  is  the  first  node  in 
Suppressors(v,t).  This  contradicts  our  hypothesis  that  u  precedes  w  in  Suppressors(v,t). 
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U  Suppressor -Link  (y)  *  nil  at  t\,  then  v  sets  /D(v)  to  ID(w)  at  tu  and  sends  “Potential- 
Suppressor-OVD(w))”  on  Suppressor-Linkfv).  Suppose  that  w  eliminates  v  at  some  time  t2,  where 
<t2Z  t.  Then,  by  the  algorithm,  v  receives  “Potential-Suppressor-Successful-(iVD(w)>”  at  t2.  But 
since  u  precedes  w  in  Suppressors(v,t),  then  ru  <  t2.  Thus  ID(v)  *  IDfyv)  at  t2  because  t \  <  ru. 

According  to  the  algorithm,  v  cannot  set  Suppressor-Linkfv)  to  /(v,w)  at  t2,  a  contradiction. 

Hence  tu  <  r^D 

Lemma  4.3  shows  that  the  Algorithm  Invariant  indeed  holds. 

Lemma  4J:  Suppose  that  u  eliminates  v  at  t.  Then,  all  the  nodes  except  u  in  Suppressors  fv, t)  are  dead 
at  t. 

Proof:  The  proof  proceeds  by  induction  on  the  length  of  Suppressors(y,t),  denoted)  Suppressors(y,t)  (. 
Basis:  Suppose  that  | Suppressors(v,t ) |=  1.  Then  Suppressors(v,t )  consists  only  of  u,  and  the  lemma  is 
trivially  true. 

Inductive  Step:  Suppose  that  the  lemma  is  true  for|  Suppressors(y,t)  |  Sp-1  for  some  integer  p  >  2. 
Consider  the  case  when  |  Suppressors(y,t)  \  - p.  Let  Suppressors(v,t)  =  w  j  ,w2,  •  •  • ,  Wp-\ ,  u.  By 
Lemma  4.1,  all  the  nodes  in  Suppressors(v,t )  are  distinct.  Let  the  time  when  eliminates  v  be  rw. 
Since  u  eliminates  v  after  does,  then  tw  <t.  Node  v  receives  the  fatal  message  “Eliminate- 
( iJD(u ))”  from  u  at  some  time  f  i  <  f,  where  i  is  some  phase  value.  By  Lemma  4.2,  tw<t\.  Thus  v 
receives  the  fatal  message  from  u  after  v  sets  Suppressor-Linkfv)  to  l(y,w^).  Therefore,  v  receives 
Af=“Potential-Suppressor-Successful-OV£(w))”  from  w ^  at  t.  By  the  algorithm,  wp_i  either  eliminated 
itself  or  had  a  suppressor  when  ^>p-\  sent  M.  In  either  case,  wp_i  is  dead  at  t.  By  the  inductive  step, 
nodes  w,,  •  •  • ,  Wp_2  are  all  dead  at  tw,  and  hence  at  r.D 

Suppose  that  u  is  live  when  u  sets  phasefu)  to  an  integer  i  at  t.  Then  define  Victims{u,i)  as  the  set 
of  all  nodes  that  u  eliminates  before  or  at  t.  If  u  never  sets  phase(u)  to  an  integer  j,  or  if  u  is  dead  when  u 
sets  phase (u)  to  j,  then  we  define  Victims(u,y)  as  0. 

Lemma  4.4:  For  every  phase  value  i  and  nodes  u  and  w,  Victimsfu.OpiVictimsCw.O^- 
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Proof:  Suppose  that,  contrary  to  the  lemma,  there  exist  some  u,  v,  w,  and  i,  such  that 
ve  Victims(u,i)PtVictims(H',z).  Let  tu  be  the  time  at  which  u  sets  phaseiu )  to  i,  tw  be  the  time  at  which  w 
sets  phase  {w)  to  i,  tm  be  the  time  at  which  u  eliminates  v,  and  be  the  time  at  which  w  eliminates  v. 
Without  loss  of  generality,  let  tu  $  tw.  Since  ve  Victims(u,0  and  ve  VictimsO.i),  then  tm  5  tu  and 
s  tw.  By  the  total  ordering  on  Events(v),  tm  *  t^.  There  are  three  cases,  depending  on  the 
relationships  among  tM,  tm,  and  tu: 

Case  1:  tw  <  tm\  By  Lemma  4.3,  w  is  dead  at  tm.  Since  <  tuv  <  tw,  then  w  is  dead  at  tw.  Thus 
Victims(w,/>=0,  a  contradiction. 

Case  2:  r„,  <  c^<  tu:  By  Lemma  4.3,  u  is  dead  at  tu.  Hence,  Victims(u,/)=0,  a  contradiction. 

Case  3:  tm<>tu<  tw:  By  Lemma  4.3,  u  is  dead  at  tOT.  Since  Victims(u,0  *  0,  then  u  is  live  at  tu. 
Therefore,  Suppressor-Link(v)  at  tu  is  l(v,u).  Suppose  that  t  is  the  time  at  which  v  received  the  fatal 
message  M  from  w.  By  the  total  ordering  on  Events(v)  and  Lemma  4.2,  t  Since  v  sets 

Suppressor-Link(  v )  to  /(v.w)  at  t„>  tu,  and  u  is  live  at  tu,  then  phase  (w)  in  M  must  be  at  least  equal  to  i. 
Thus  phaseiyv )  is  at  least  i  at  t .  This  contradicts  the  fact  that  w  sets  phase(w)  to  i  at  tw. 


Thus  all  three  cases  lead  to  contradictions. □ 

Lemma  4.5:  For  every  execution  £,  there  are  no  two  leaders  in  E. 

Proof:  Suppose  that,  to  the  contrary,  there  are  two  leaders  u  and  w  in  some  E.  Let  i  =T 

Then,  phase{u)  and  phaseiyv)  become  i  in  E.  Thus 


n+2 

2(r-l)/ 


1  +L 


|  Victims(u,t)  |  ,  |  Victims(w,j)  |  >  (r-1)/  =  nodes.  Since  there  are  n  nodes,  there  is  at  least 

2(r-l)/  i. 

one  node  v  in  both  Victims (u,i)  and  Victims(w,0.  This  contradicts  Lemma  4.4.D 


Let  M  be  an  eliminate-message  that  u  sends  on  l(u,v).  We  define  Poth(M)  as  follows:  If  M  is  lost 
on  /(u,v),  or  if  Suppressor  -Link  (v)  =  nil  when  v  receives  M,  then  Path(M)  =  (l(u,v)}.  Otherwise, 
PathiM)  -  /7(u,v),/(v,w)j,  where  /(v.w)  =  Suppressor  -Link  (v)  when  v  receives  M.  Recall  that  New(u,i) 
is  the  multiset  of  the  eliminate-messages  that  a  live  node  u  sends  on  each  link  in  Chosen-Links(u.i)  when 
phase (u)  becomes  i.  We  say  that  M  is  new  if  Ms  New  ( u,i ),  for  some  t. 
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The  following  lemma  shows  that  the  paths  of  any  two  distinct  new  messages  sent  by  a  node  are 
disjoint 

Lemma  4.6:  Let  Af  and  Af  be  two  distinct  new  messages  sent  by  u.  Then,  Path  (Af ) (~\Path (Af  )  ~  0. 
Proof:  When  u  sends  a  new  message  on  a  link  l,  u  deletes  l  from  Untrav(u).  Hence,  u  sends  Af  on  some 
Z(u,  v),  while  u  sends  M  on  some  l(u,w),  where  v  *  w.  Clearly,  Suppressor-Link! v)  *  Suppressor-Link(w) 
unless  both  suppressor-links  are  nil.D 

If  an  event  e  in  Events  (u)  occurs  when  phase(u)  =  t,  then  we  say  that  e  occurs  during  phase  i.  Let 
Af  i  be  an  eliminate-message  sent  by  u  on  some  /(u,v)  during  phase  i.  We  say  that  a  message  Af  2  is  a 
successful  reply  for  Af  t  if  ^^‘Elimination-Successful,”  and  if  v  sent  Af  2  on  l  ( v,  u)  in  response  to  M\. 
We  say  that  M2  is  an  unsuccessful  reply  forAf  \  if  Af2=‘‘Elimination-Unsuccessful-(iJD(u)),”  and  if  v 
sent  A/ 2  on  l(v,u)  in  response  to  Af  i .  We  say  that  Af  2  is  a  reply  forAf  i  if  M2  is  a  successful  or 
unsuccessful  reply  for  M  j . 

According  to  the  algorithm,  an  eliminate-message  can  be  either  new  or  a  refresher  for  another 
eliminate-message.  By  Lemma  4.6,  if  u  sends  two  eliminate-messages  on  the  same  link,  then  at  least  one 
of  the  messages  is  a  refresher. 

Lemma  4.7:  Suppose  that  u  sends  m  2: /eliminate-messages  in  the  execution.  If  no  leader  is  elected  in 
E,  then  u  receives  at  least  m-f  replies  for  these  messages. 

Proof:  Because  there  is  no  leader  in  £,  every  node  continues  to  process  messages.  Thus  a  node  u 
receives  no  reply  for  an  eliminate-message  Af  only  if  Path(M)  contains  a  faulty  link.  Let  S(u)  be  the 
multiset  of  the  m  eliminate-messages  sent  by  u  in  the  entire  execution.  Some  of  the  eliminate-messages 
in  S(«)  may  be  refresher  messages.  Construct  the  multiset  S  (u)  from  S(u)  as  follows:  An  eliminate- 
message  M €  S(u)  sent  to  a  node  v  is  in  S  (u)  if  and  only  if  M  is  the  last  eliminate-message  in  S(u)  that 
was  sent  to  v.  Note  that  if  Af  is  in  S(u)-S  (u ),  then  u  must  have  sent  a  refresher  message  for  Af,  and  thus  u 
received  a  reply  for  Af . 
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Let  (S  (u)\  =  m‘.  If  m  £  /,  then  u  received  replies  for  at  least  m-m  £  m-/eliminate-messages,  and 
the  lemma  is  true.  If  /  <  m  ,  then  let  M  i  and  M2  be  any  two  distinct  messages  in  S  (u).  Suppose  that  u 
sent  M  i  to  some  node  v.  By  the  definition  of  S(u),  u  must  have  sent  M2  to  some  node  w  *  v.  Thus,  by 
the  definition  of  Path(Mi )  and  Paxh(M2 ).  Path  (M 1  )f^Path  (M 2)  =  0.  Hence  at  most  / messages  M  in 
S  (a)  have  a  faulty  link  in  Path(M),  and  at  least  m'-f  replies  are  received  for  messages  in  S(u).  The  total 
number  of  replies  that  u  receives  is  at  least  ( m-m  )  +  (m  '-f)  =  m-f.H 

n+2 


Suppose  that  u  is  live  when  u  sets  phase(u)  to  i  at  some  time  t,  for  some  2  S  i  <,  f  • 


-1  .  Then 


2(r-l)/ 

define  Old(u,i )  as  the  multiset  of  the  elirainate-messages  that  u  sent  and  for  which  u  did  not  receive  a 
reply  by  r.  Define  Old(u,  1)  as  0.  Recall  that,  if  u  is  live,  then  u  sends  a  refresher  message  as  soon  as  u 
receives  an  out-of-date  message. 

Lemma  4.8:  Let  u  be  live  when  u  sets phase(u)  to  i  £  2.  Then, | Old(u,i)\- f.  Hence  u  receives  at  most 
/out-of-date  messages  during  phase  i. 

Proof:  The  proof  proceeds  by  induction  on  i. 


Basis:  Suppose  that  i  =  2.  Node  u  sent  rf  new  messages  during  phase  1  and  received  (r-1)/  successful 
replies  for  them.  Hence  the  lemma  is  true. 


Inductive  Step:  Assume  that  the  lemma  holds  for  i  -  p- 1 ,  for  some  3  £  p  S  f 


n+2 

2(r-l)/ 


]  .  Suppose  that 


i  =p.  During  phase  p-1,  u  sent  (r-l)/new  messages  and  received  replies  for  it  messages  in  Old(u,p- 1 ), 
where  0  <;  k  £/.  Suppose  that,  during  phase  p-1,  u  received  successful  replies  for  ti  messages  in 
Old(uj>-\),  where  OS  kx  <,k.  Then  u  received  unsuccessful  replies  for  k-k  j  messages  in  Old(u,p-\ ) 
during  phase  p-1.  Since  u  is  live  in  phase  p,  all  these  unsuccessful  replies  were  of  the  form 
“Elimination-Unsuccessful-0/D(u)),”  where  j  <  p-1.  Thus  all  of  the  unsuccessful  replies  that  u 
received  in  phase  p-1  were  out-of-date  messages.  Hence  u  sent  k-kl  refresher  messages  during  phase 
p-1.  Thus  u  sent  (r-\)f+k-k\  eliminate-messages  during  phase  p-1.  Since  u  advanced  to  phase  p,  u 
must  have  received  successful  replies  for  (r-\)f-k\  eliminate-messages  sent  during  phase  p-1.  Hence 
Oldfup)  is  equal  to  the  union  of  [the  multiset  of  eliminate-messages  in  Old(up-\ )  for  which  u  did  not 
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receive  replies  during  phase  p-1  J  and  [the  multiset  of  the  eliminate-messages  sent  during  phase  p-1  for 
which  u  did  not  receive  replies  during  phase  p-1].  Hence, 

\Old(u,p)\-  [f-k}+[(r-l)f+k-kl-<(r-l)f-kl)]=fn 
Lemma  4.9:  There  exists  at  least  one  leader  in  the  execution  E. 

Proof:  Suppose  that,  to  the  contrary,  no  leader  exists.  Thus,  for  every  node  v,phase(v)  <  [  ^-1  +1. 

Since  no  node  can  decrease  its  phase,  there  exists  some  time  tf  such  that  no  node  changes  its  phase  after 
tf.  At  tf,  let  (i,f)  be  the  highest  (phase, ID)-pair  among  the  nodes.  Let  u  be  the  unique  node  whose 
identifier  was  j  at  the  start  of  the  algorithm.  Then  phase(u )  is  i  at  t/.  To  see  this,  let  v  be  any  node  whose 
(phase,ID)-pair  is  (ij)  at  tf.  If  v  =  u,  then  phase(u)  is  i  at  tf.  Suppose  v  *  u.  Node  v  is  dead  at  tf  since 
/D(v)  =  j  at  tf.  Thus,  v  must  have  received  “ Eliminate^ at  or  before  tf.  By  the  algorithm,  only  u 
can  send  “Ftimina te-(i,j).”  Hence,  by  the  choice  of  (i,j),  and  since  no  node  can  decrease  its  phase, 
phase(u)  is  i  at  tf.  Note  also  that  u  is  live  at  tf.  To  see  this,  let  tu  be  the  time  when  u  sets  phase{u)  to  i 
for  the  first  time,  where  tu  <,  tf.  At  tu.  u  is  either  dead  or  live.  Since  u  changes  phase{u)  to  i  at  tu,  then  if 
u  is  dead  at  ru,  then  u  must  change  lD(u)  according  to  the  algorithm.  Thus  ID(u)*j  at  ru.  By  the 
algorithm,  u  never  sends  eliminate-messages  after  u  is  dead.  Thus  u  never  sends  “Eliminate-(i,y).”  But 
this  means  that  (i,j)  is  not  a  (phase.ID)-pair  of  any  node,  a  contradiction.  Therefore,  u  is  live  at  tu.  By  the 
choice  of  (t,y),  and  since  the  (phase  JD)-pair  of  any  node  can  not  decrease,  then  no  node  can  eliminate  u 
after  tu.  Thus  u  is  live  at  tf. 

To  prove  the  lemma,  there  are  two  cases,  depending  on  the  value  of  i: 

Case  1:  i>  2.  Suppose  that  u  sets  phase (u)  to  i  at  some  time  tu  <  tf.  After  tu,  u  sends  (r-l)/new 
messages.  Because  there  is  no  leader  in  £,  u  must  receive  replies  for  all  the  messages  M  in  Old(u,i)  and 
in  New(u.i),  where  Path(M)  contains  no  faulty  links.  If  M  is  in  New(u.i),  then  the  reply  for  M  is  a 
successful  reply  since  u  is  live  at  tf.  If  M  is  in  Old(u,i),  then  the  reply  for  M  can  be  an  out-of-date 
message.  Since  E  has  no  leader,  u  has  not  halted,  and  u  must  send  a  refresher  message  for  M.  By 
Lemma  4.7,  at  most /messages  in  New{u,i)\j  Old(u,i)  are  lost.  Hence,  u  receives  at  least 
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(r-l)/+|  Old(u,i) | -/ ‘ ‘Elimination-Successful"  messages  during  phase  i,  all  with  nonfaulty  paths. 

Hence,  by  Lemma  4.8,  u  receives  at  least  (r-1)/  successful  reply  messages  during  phase  i.  Thus  u  must 
increment  phase(u)  by  1  some  time  after  tu.  This  contradicts  our  assumption  that  phase(u)  =  i  at  tj. 

Case  2:  i  =  1.  This  case  is  very  similar  to  the  previous  case,  except  that  u  sends  rf  new  messages  during 
phase  1,  and  that  |  Old(u,  1)  |  =0.  This  case  also  leads  to  a  contradiction. □ 

Theorem  4.1:  For  every  execution  E  of  the  algorithm,  there  exists  a  unique  leader.  Furthermore,  every 
node  in  the  network  knows  the  ID  of  the  leader. 

Proof:  By  Lemmas  4.5  and  4.9,  there  exists  a  unique  leader  u  in  E.  Furthermore,  u  sends 
“ ANNOUNCE-LEADER-(/D(u))”  to  2/+1  nodes.  Since  there  are  at  most/  < |_  y-3j  faulty  links,  there 

exists  a  node  v  that  is  not  adjacent  to  any  faulty  links  and  that  receives  “  ANNOUNCE-LEADER- 
(/D(u)).”  According  to  the  algorithm,  v  sends  /D(u)  to  all  adjacent  nodes.  Thus  all  the  nodes  will  know 
the  ID  of  the  leader.  □ 


3.5  Message  Complexity 


The  following  lemma  specifies  the  maximum  number  of  live  nodes  that  reach  phase  i. 


Lemma  5.1 :  Let  i  be  an  integer  such  that  2  <>  i  <\ 


n+2 

2(r-l)/ 


1+1.  For  every  i,  there  are  at  most 


l77 


-J  nodes  u  such  that  u  is  live  when  u  sets  phase{u)  to  i. 


(i-l)(r-l)fJ 

Proof:  Suppose  that  there  are  at  most  k  nodes  u  such  that  u  is  live  when  u  sets  phase{u)  to  i.  Trivially, 
k  ■£  n.  Let  Uj  denote  the  y'-th  live  node  to  set  its  phase  to  i,  for  every  1  <  j  <  k.  By  induction  on  i,  we  can 
show  that  Uj  must  have  eliminated  at  least  (i-l)(>-3)/nodes  to  reach  phase  i.  By  Lemma  4.4, 

Victims(uy, On  Victims^  .0=0  for  every  \<,j*p<>k.  Hence  k  <,  [_  ^  -d 

Theorem  5.1:  The  algorithm  uses  0(  nrf+  log(  —  *  )  messages  in  the  worst  case,  where  n  is 

the  number  of  nodes  in  the  network,  and  /is  the  maximum  number  of  faulty  links. 
n+2 


Proof:  Let  i  <  f 


2(r-l)/ 


1  +1  be  an  integer.  If  a  node  u  is  live  when  it  sets  phase(u)  to  i  at  some  time  t. 
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then  u  sends  (r-l)/new  messages  during  phase  i.  By  Lemma  4.8,  u  sends  at  most  / refresher  messages 
during  phase  i.  Hence  u  sends  at  most  r/eliminate-messages  during  phase  i.  Each  eliminate-message 
generates  at  most  three  additional  messages  as  follows: 


(1)  Suppose  that  u  sends  an  eliminate-message  to  a  node  v. 

(2)  Node  v  sends  “Potential-Suppressor-(z7D(u)>”  on  Suppressor-Link(v). 

(3)  Node  v  receives  "Potential-Suppressor-Successful-(z7D(u))”  or  “Potential-Suppressor- 
Unsuccessful-(iVD(u))”  on  Suppressor-Link(v). 

(4)  Finally,  v  sends  a  reply  to  u. 


Thus  u  causes  at  most  4r/ messages  to  be  generated  while  phase(u)  is  i.  The  total  number  of 
messages  the  algorithm  uses  is  the  sum  of  the  number  of  messages  generated  to  elect  a  leader  plus  the 
number  of  messages  used  to  inform  all  the  nodes  of  the  ID  of  the  leader.  By  Lemma  5.1,  the  number 
NUM  of  the  messages  generated  to  elect  a  leader  is  as  follows: 


r  ■ 

Ur-Vtf' 

NUM  £  £  (num.  live  nodes  u  that  reach  phase  i){num.  messages  generated  by  u  s  i) 

<-l 

r  i  n+2 

Jfr-W  ,  n  ,  4 nr  i 

*  I  l  7,_,  v~  n 7-1  +  W  <  7TTT  I  —  +  4 nr/ 


its  L  0-l)(r-l)/- 
"UM  =  Oi-^loSi-^)  +  nrf) 


(r-\)  i 


The  algorithm  uses  (2/+l)n  =  0{nf)  messages  to  inform  all  the  nodes  of  the  ID  of  the  leader.  □ 


A  detailed  analysis,  omitted  here,  shows  that  the  value  of  r  that  minimizes  the  total  number  of 

messages  is  r  =  min(  1  +C(  )1;2  ,  ~~  ),  where  C  =  l+0(  !°^^  ).  For  every  value  of  n  and  / 

/  2/  log  n 

subject  to/£|_  3j  ,  y  £  C  £  1.  Thus  the  minimum  number  of  messages  that  our  algorithm  uses  is 

2  o 


0(nf+  nlog  n )  messages. 
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3.6  Time  Complexity 

Theorem  6.1:  The  algorithm  takes  at  most  0(  --/1— )  time  units  to  complete,  where  n  is  the  number  of 
nodes,  and /is  the  maximum  number  of  faulty  links. 

Proof:  The  maximum  running  time  of  the  algorithm  is  the  maximum  time  spent  to  elect  a  leader  in  any 
execution.  Consider  any  execution  £,  and  let  u  be  the  leader  when  E  terminates.  Assume  the  delay  on 
each  reliable  link  to  be  at  most  one  time  unit.  Let  M  be  any  eliminate-message  that  u  sends.  As  in  the 
proof  of  Theorem  5. 1,  M  generates  at  most  three  additional  messages.  Each  message  reaches  its 
destination  in  at  most  one  time  unit  Hence  u  receives  a  reply  for  M  in  at  most  4  time  units.  Thus  ,  by 
induction  on  i,  u  spends  at  most  4(i — 1)  time  units  to  reach  phase  i.  In  particular,  u  spends  at  most 


n+ 2 


-]  time  units  to  elect  itself  as  the  leader.  By  the  algorithm,  all  the  nodes  will  know  the  ID  of 


2(r-l)/' 

the  leader  in  at  most  two  more  time  units.  Since  all  the  nodes  start  executing  the  algorithm 
simultaneously,  the  algorithm  will  terminate  in  £  4f  +2=0  (  —  )  time  units.D 

3.7  Formal  Description  of  the  Algorithm 

We  assume  in  what  follows  that  the  node  ID’S  are  integers.  The  algorithm  for  each  node  u  uses  the 
following  variables  and  data  structures: 

*  LlNK{u)  is  the  set  of  the  names  of  all  links  incident  on  u. 


*  UNTRAV(u)  is  a  set  of  link  names.  Initially,  UNTRAV(u)  =  UNK{u).  (Link  le  UNTRAV(u)  iff  /  is 
incident  on  u  and  u  has  not  sent  any  eliminate-message  on  /.) 

*  phase{u)  is  an  integer  variable.  Initially,  phase (u)  =  0. 

*  dead(u)  is  a  Boolean  variable.  Initially,  dead(u)  is  false. 

*  num_cf  _victims(u)  is  an  integer  variable.  Initially,  num  of  _victimi(u)  =  0.  (This  variable  is  a  lower 
bound  on  the  number  of  nodes  that  u  eliminated  during  the  current  phase.) 
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*  SUPPRESSOR_LINK(u)  is  a  pointer  to  a  link.  Initially,  SUPPRESSOR  JJNK (u)  =  l(u,u). 

*  P0TENT1AL_SUPPRESS0R_ARRA Y_u  is  an  integer  array  indexed  by  the  names  of  the  links  incident 
on  u.  Initially,  POTENTIAL_SUPPRESSOR  -ARRAY _u[l]  =  nil  for  each  l  in  LINK(u).  (Intuitively, 
POTENTIAL_SUPPRESSOR_ARRAY _u  points  to  the  nodes  that  are  potential  suppressors  of  u.) 

*  leader(u)  is  an  integer  variable.  Initially,  leader(u)  =  nil. 

The  algorithm  for  each  u  is  as  follows: 

(The  comments  refer  to  the  cases  in  Section  3.3.2.) 


begin 

Set  phase (u)  to  1; 

Choose  r/links  from  UNTRAV(u)\ 


Call  the  chosen  links  e  [  ,ei,  •  •  •  ,erf, 
UNTRAV(u)\=UNTRAV(u)-{ex,e2,  ■  ■  •  ,erf}\ 

Send  “ EUMlNATE_(\JD(u ))”  on  each  e,,  where  i  -  1,2,  -  •  ■  ,rfi 


while  phase(u)  <  f 


2(r— 1)/ 


+1  or  dead(u) 


begin 


Receive  some  message  M  on  some  link  /; 


case  M  of: 


(1)  M  is  “ ELIMINATE JkJY'  and  SUPPRESSOR  LINK (u)  =  nil' 

/*  Case  1  */ 


if  (k,j)  <  (phase (u)JD(u))  then  Send  “ ELIMINATION  UNSUCCESSFUL^. j )"  on  /; 
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/*  Case  2.1  */ 

if  (k,j)  >  (phase(u),ID(uy)  then 
begin 

SUPPRESSOR_UNK(u):=l\ 

dead(u):=true\ 

( phase(u),lD(u)):=(k,jy, 

Send  “ ELIMINATION -SUCCESSFUL "  on  l 
end  if; 


(2)  M  is  "ELIMINATE and  SUPPRESSOR_UNK(u )  *  nil: 

I*  Case  1  */ 


if  (k,j)  <  (phase(u),ID(u))  then  Send  “ EUMINATION_UNSUCCESSFULJkJ )”  on  l 

/*  Case  22  */ 

if  (k,j)  >  (phase(u),ID(u ))  then 
begin 

POTENTIAL_SUP  PRESS  OR_ARRA  Y_u  [/]:=;; 

( phase(u),ID(u)):=(k,j); 

Send  "POTENTIAL ^SUPPRESSOR JkJ)' '  on  SUPPRESSOR_UNK(u) 
end  if; 

/*  Case  12  *1 


(3)  M  is  "ELIMINATION _UNSUCCESSFUL_{k,  j)”  and  not  dead(u ): 
if  phase(u)  =  k  then  dead(u):=true; 

if  phaselu)  *  k  then  Send  "EUMINATE_(phase{u)JD{u))"  on  /; 

(4)  M  is  ‘  *  ELIMINATION JUCCESSFUL' ’  and  not  dead(u): 

num_of  ^victims  ( u )  :=num_of_victims  (u)+l; 

I*  increment phase(u)  */ 

if  num_of  _victims{u)  =  (r-l)/then 

begin 

phase(u):=phase(.u)+\; 

itphase(u)  <f  +1 
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then  begin 

Set  num_of_victims(u )  to  0; 

Choose  (r-l)/links  from  UNTRAV{u)\ 

Call  these  links  ei,e2,  ‘  •  * 
UNTRAV(u):=UNTRAV{uy-{ex,  ■  ■  ■  ,e{r.X)f}\ 
Send  “ELIMINATE_(p/i<ue(KyD(u))’*  on  each 
end  if; 

end  if; 


(5)  M  is  ‘  *  POTENTIAL JUPPRESSORJkJ)" : 

I*  Case  22.1  *1 

if  (kj)  <  (phase(u),ID(u))  then  Send  “ POTENTIAL JUPPRESSOR_UNSUCCESSFULJk.j )”  on 


I*  Case  2 22  *1 


if  (k,j)  >  (phase(u)JD(u))  then 
begin 

dead(u):=true; 

Send  “ POTENTIAL _SUPPRESSORJUCCESSFUL_(k,jy  ’  on  l 

end  if; 


/•  Case  22.1  *1 

(6)  M  is  ‘ * POTENTIAL_SUPPRESSOR_UNSUCCESSFULJk,j )” ; 

Find  the  link  l  such  that  POTENTIAL  JUPPRESSOR_ARRAY _u[l  ]  =  j\ 
Send  " ELIMINATION -UNSUCCESSFUL Jk,j)''  on/'; 

/*  Case  2 22  *1 

(7)  Mis  "POTENTIAL-SUPPRESSOR -SUCCESSFUL Jk,j)'': 

Find  the  link  l  such  that  POTENTIAL_SUPPRESSOR_ARRAY_u  [/']  =  j\ 
if 7  *ID{u)  then  Send  “ EUMINATION_UNSUCCESSFULJk,j )"  on  /'; 
if  j  =  lD{u)  then 
begin 

Send  "ELIMINATION -SUCCESSFUL"  on  /'; 


SUPPRESSOR  JJNK(u):=l' 
end  if; 


47 


/•  Case  1.1  */ 

(8)  A/  is  "ELIMINATION -SUCCESSFUL”  or  "EUMINATION_UNSUCCESSFUL_(kJ),” 
and  dead(u): 

Discard  M: 

(9)  Mis“  ANNOUNCE _LEADER_(J)": 
dead(u)'.=true\ 

Send  “ LEADER _IS_(J)”  on  each  link  in  UNK(u)\ 
leader  (u)~j 
halt  algorithm. 

(10)  Af  is  "LEADER  JSJj)": 
dead(u):-true\ 
leader{u):=j 

halt  algorithm. 


end  while; 


/*  u  elects  itself  as  a  leader  *1 


Choose  2/+1  links  from  UNK(u)\ 

Send  "ANNOUNCE  _LEADER_{ID{u))”  on  each  chosen  link; 
leader  (u):=ID(u) 


halt  algorithm.G 
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CHAPTER  4 

UPPER  AND  LOWER  BOUNDS 
FOR  ELECTION  IN  SYNCHRONOUS  SQUARE  MESHES 

4.1  Introduction 

Many  election  algorithms  for  various  synchronous  networks  were  proposed  in  the  literature. 
Frederickson  and  Lynch  [22]  showed  that  Q(nlog  n)  messages  are  necessary  for  election  in  synchronous 
rings  when  the  election  algorithm  is  required  to  being  a  comparison  algorithm,  i.e.,  when  the  election 
algorithm  uses  only  comparisons  of  processor  identifiers.  On  the  other  hand,  the  comparison  algorithm  of 
Loui,  Matsushita,  and  West  [35],  originally  designed  for  asynchronous  complete  networks,  uses  less  than 
4 n  messages  in  the  worst  case  on  synchronous  complete  networks.  A  natural  question  arises,  therefore, 
about  whether  there  exists  a  comparison  algorithm  on  a  bounded  degree  network  using  0  (n)  messages. 
Peterson  [41]  answered  this  question  in  the  affirmative  when  he  designed  a  comparison  algorithm  for 
asynchronous  square  meshes  that  uses  about  90 n  messages  in  the  worst  case.  When  run  on  synchronous 
square  meshes,  the  message  complexity  of  Peterson’s  algorithm  becomes  32n.  Although  Peterson’s 
algorithm  is  of  theoretical  importance,  it  is  not  practical  because  of  the  large  constants  in  the  message 

229 

complexity.  In  this  chapter  we  present  a  comparison  algorithm  for  square  meshes  that  uses  at  most  — -n 

L  o 

messages,  runs  in  time  Oi'In  )  time  units,  and  requires  0  Gog  |  T  | )  bits  per  message,  where  n  is  the 
number  of  processors  in  the  mesh,  and  |  T  |  is  the  cardinality  of  the  set  of  processor  identifiers.  Also,  we 

57 

prove  that  any  comparison  algorithm  on  synchronous  meshes  requires  at  least  —  n  messages.  The  lower 
bound  holds  a  fortiori  for  asynchronous  meshes. 

42  Model 

We  define  an  rt-mesh  as  a  square  of  n  processors,  with  VtT  processors  on  each  side,  but  with  each 
column  forming  a  ring  and  each  row  forming  a  ring.  (Hwang  and  Briggs  [29]  call  our  mesh  a  near¬ 
neighbor  mesh.)  Figure  4.1  shows  a  9-mesh.  Two  processors  in  an  n-mesh  communicate  by  sending 
messages  to  each  other  either  on  the  link  joining  them  or  via  other  processors.  We  assume  a  synchronous 


Figure  4.1.  A  9-mesh 


mode  of  communication.  This  implies  that  there  is  an  upper  limit  on  link  delays  before  messages  are 
delivered.  We  take  the  delay  on  each  link  to  be  at  most  one  time  unit. 

We  assume  that  all  the  processors  in  the  n-mesh  are  identical  except  that  each  processor  p  has  a 
unique  identifier  ID  (p)  chosen  from  a  totally  ordered  set.  Initially,  no  processor  knows  the  identifier  of 
any  other  processor.  We  assume  that  the  processors  know  that  the  network  is  a  square  mesh,  but  they  do 
not  know  the  value  of  n.  We  also  assume  that  the  processors  have  a  sense  of  direction  [45].  Informally, 
sense  of  direction  means  that  the  processors  have  a  uniform  notion  of  which  of  their  four  links  is  the  east 
link,  the  south  link,  the  west  link,  and  the  north  link.  Thus,  for  example,  if  processor  p  sends  a  message 
M  on  p’s  east  link,  then  processor  q  that  receives  M  knows  that  q  received  M  from  the  west. 

A  distributed  algorithm  for  an  n-mesh  is  a  set  of  n  identical  programs;  each  program  is  assigned  to  a 


processor.  We  model  each  processor  p  as  an  automaton.  An  election  algorithm  specifies  the  following 
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fo  rp: 

(1)  the  set  of  states  Sp.  The  initial  state  of  p  is  p’s  identifier  ID(p). 

(2)  a  subset  Lp  of  Sp  called  the  set  of  elected  states.  If  p  is  in  a  state  in  Lp,  then  p  is  elected  as  a  leader. 

(3)  a  message  generation  function  that  maps  each  state  s  of  p  to  a  4-tuple  ( MeMsMwMn),  where  each 
element  of  the  4-tuple  is  either  a  message  or  the  atom  nil.  Let  pe  (respectively  p$,  pw,ps)  denote 
p’s  east  (respectively  south,  west,  north)  neighbor.  If  Me  (respectively  Ms,  Mw,  MN)  is  not  nil,  then 
Me  (respectively  Ms,  Mw,  MN)  represents  the  message  that  p  sends  to  pe  (respectively  ps,  pw,  Pn ) 
when  p  is  in  state  s.  If  Me  (respectively  Ms,  Mw,  MN)  is  nil,  then  p  does  not  send  a  message  to  p£ 
(respectively  ps,  pw ,  Pn)  when  p  is  in  state  s. 

(4)  a  transition  function  that  maps  the  4-tuple  ( MeMsMwMn )  and  p’s  current  state  s  to  p’s  next  state, 
where  each  element  of  the  4-tuple  is  either  a  message  or  the  atom  nil.  If  Me  (respectively  Ms,  Mw, 
Mn)  is  nil,  then  p  did  not  receive  a  message  from  pE  (respectively  Ps,  pw,  Pn)  when p  is  in  state  s. 
We  assume  that  if  p’s  current  state  is  in  Lp,  then  p’s  next  state  is  in  Lp.  In  other  words,  Lp  is  a 
closed  set. 

In  this  chapter,  the  /r-mesh  is  synchronous.  Thus  election  algorithms  on  the  mesh  proceed  in 
rounds.  In  each  round,  each  processor  p  sends  some  messages  according  to  the  message  generation 
function  and  p’s  current  state,  receives  the  messages  sent  to  p  in  the  current  round,  does  some  internal 
computations,  and  changes  state  according  to  the  transition  function. 

We  wish  the  processors  to  execute  an  election  algorithm  so  that  a  unique  processor  is  chosen  as  the 
leader  of  the  network  when  the  algorithm  terminates. 

43  Upper  Bounds 

We  now  present  an  algorithm  for  election  in  synchronous  /j-meshes.  All  the  processors  start  the 


algorithm  simultaneously. 
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43.1  Overview  of  the  algorithm 

The  algorithm  proceeds  in  phases.  Each  processor  tries  to  eliminate  all  other  processors  from  the 
competition  to  be  the  leader.  The  processor  with  the  largest  identifier  will  be  elected  as  the  leader.  We 
say  that  processor  u  is  live  at  the  start  of  phase  i  if  u  has  not  been  eliminated  before  the  start  of  phase  i.  If 
u  is  not  live,  then  u  is  dead.  A  dead  processor  does  not  initiate  any  messages,  although  it  may  relay 
messages.  Each  processor  u  keeps  a  variable  called  largest-seen  (u)  that  contains  the  largest  identifier 
that  u  is  aware  of.  Initially,  largest-seen  (u)  -  ID  (u). 

Before  we  give  the  details  of  the  algorithm,  we  present  the  intuition  behind  it.  Define  a  q-square  of 
processors  to  be  a  square  mesh  of  processors  with  q +1  processors  on  each  side  but  without  wrap-around 
connections.  As  in  Figure  4.2,  let  SE  ( u,q )  denote  the  q-square  with  processor  u  at  the  northwest  comer. 
When  there  is  no  confusion,  we  let  SE  ( u,q )  denote  also  the  set  of  all  processors  that  are  contained  in  or 
on  the  boundary  of  square  SE  ( u,q ).  Let  SW (u,q)  denote  the  q-square  with  processor  «  at  the  northeast 
comer.  Define  NW(u,q),  and  NE{u,q)  similarly.  The  algorithm  uses  four  kinds  of  messages.  We  will 
explain  their  use  as  we  explain  the  algorithm.  The  messages  are 

(1)  Eliminate  messages  of  the  form  "Eliminate_[/D  t  .kJDj],''  where  ID  i  is  the  identifier  of  the 
processor  that  initiates  the  message,  k  is  the  number  of  links  that  the  message  will  traverse  in  the 
current  direction,  and  ID 2  is  an  identifier  of  another  processor; 

(2)  Kill  messages  of  the  form  "Kill_[/D],"  where  ID  is  an  identifier  of  a  processor; 

(3)  Mark  messages  of  the  form  "Mark_[/D,£],"  where  ID  is  an  identifier  of  a  processor,  and  k  is  the 
number  of  links  that  the  message  will  traverse  in  the  current  direction; 

(4)  Final  messages  of  the  form  "Final_[/D],H  where  ID  is  the  identifier  of  the  processor  that  initiates 
the  message. 

For  the  rest  of  the  chapter,  we  use  the  letter  £  to  denote  Eliminate  messages,  K  to  denote  Kill  messages, 
Af  to  denote  Mark  messages,  and  F  to  denote  Final  messages.  We  let  Eu  denote  any  message  of  the  form 
"EIiminate_[/D  (i u),k,lD  ] where  ID  ( u )  is  the  identifier  of  the  processor  that  initiates  the  message.  We 


Figure4.2.  The  Definitions  of  NW  (u,q),  SE  (u,q),  SE  (u,q),  andSW(u,(7) 


also  let  Mu  denote  any  message  of  the  form  "Mark_[/D  (u),£],"  and  F„  denote  any  message  of  the  form 
"Final_[/D(u)]."  The  algorithm  has  four  main  ideas. 

Let  a  >  1  be  a  parameter  to  be  optimized  later.  The  first  idea  is  that,  at  the  start  of  each  phase  i, 
each  live  processor  u  sends  an  Eliminate  message  Eu  to  traverse  clockwise  the  boundary  of  SE  ( u ,  a‘).  If 
Eu  completes  the  traversal  of  the  boundary  of  SE  («,  a*),  and  if  u  is  live  when  u  receives  £„,  then  u 
advances  to  phase  1+1.  Suppose  that  v  is  a  processor  on  the  boundary  of  SE(u,  a‘).  and  that  v  receives  £„. 
which  contains  ID  (u ).  If  largest -seen  (v)  S  ID  («),  then  v  is  eliminated,  largest  -seen  (v)  becomes 
lD{u),  and  Eu  continues  the  traversal  of  the  boundary  of  SE  (u,  a‘).  If  largest  -seen  (v)  >  ID(u),  then  v 
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discards  Eu.  Since  the  network  is  synchronous,  if  u  does  not  receive  Eu  after  4a'  time  units,  then  u  knows 
that  Eu  was  discarded,  and  u  becomes  eliminated. 

We  can  show  that  the  first  idea  guarantees  that  if  u  advances  to  phase  i  +1 ,  then  all  the  processors  in 
SE(u,  a‘)  vj  SW(u,  a')  -  {u}  are  eliminated  by  the  time  phase  i+1  starts.  Our  second  and  third  ideas 
will  guarantee  that  all  processors  in  SW  (u,  a‘)  ^  NE  («,  a‘)  -  {u}  are  also  eliminated  by  the  time  phase 
i+1  starts.  Before  we  present  our  second  idea,  observe  that  if  w  is  a  processor  in  SW  («,  a‘),  then 
SE(w,  a*)  p,  SE (u,  a*)  contains  a  processor  vw  on  the  west  boundary  of  SE  («,  a')  and  a  processor  sw  on 
the  south  boundary  of  SE  (u,  a'),  where  vw  may  be  the  same  as  rw  (i.e.,  may  be  the  southwest  comer  of 
S£ («,  a‘».  See  Figure  4.3.  If  w  is  live  at  the  start  of  phase  i,  then  w’s  Eliminate  message  reaches  sw 
before  u's  Eliminate  message  does.  Our  second  idea  is  that,  when  traversing  the  south  boundary  of 


Figure  4.3.  Processor  w  in  SW(u,  a* ) 
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SE  («,  a‘).  u’s  Eliminate  message  Eu  can  detennine  the  processor  wo  with  the  largest  identifier  in 
SW(u,  a1)  whose  EWQ  reaches  the  south  boundary  of  SE  («,  oi).  Thus  when  Eu  reaches  the  processor  vwo 
on  the  west  boundary  of  SE  (u,  a'),  then  vWQ  sends  a  Kill  message  to  w0,  and  Eu  continues  the  traversal  of 
the  boundary  of  SE  ( u ,  a1).  Processor  w0  is  eliminated  as  soon  as  w0  receives  a  Kill  message.  Note  that 
Eu  spawns  at  most  one  Kill  message  in  phase  i. 

Our  third  idea  is  as  follows.  Suppose  that  an  Eliminate  message  Eu  successfully  traverses  the  north 
boundary  of  SE  (u,  a‘ )  in  phase  i,  but  that  a  processor  v  on  the  east  or  south  boundary  of  SE  (u,  a?) 
discards  Eu  because  ID  (u)  <  largest-seen  (v).  Then  v  sends  a  Mark  message  Mu  that  contains  ID  (u)  to 
traverse  the  remainder  of  the  east  and  south  boundary  of  SE  (u,  o').  Mu  makes  certain  that 
largest-seen  (v  )  is  at  least  ID  (u)  for  each  processor  v  on  the  east  and  south  boundary  of5£  (u,  a‘). 

Thus  when  v  receives  Mu,  processor  v  compares  ID  (u)  with  largest -seen  (v  ).  If 
lD{u)  <,  largest  -seen  (v  ),  then  Mu  continues  its  traversal.  If  largest  -seen  (v')  <  ID(u),  then  v’  is 
eliminated,  v  sets  largest -seen  (v  )  to  ID  ( u ),  and  v  relays  Af„.  The  processor  on  the  southwest  comer 
of  SE  (u,  a‘)  discards  Mu.  Similarly,  if  a  processor  v"  on  the  north  boundary  of  SE  (u,  a*)  discards  £„, 
then  v  sends  a  Mark  message  Mu  that  contains  ID  (u)  to  traverse  the  remainder  of  the  north  boundary  of 
SE  («,  a‘).  The  processor  on  the  northeast  comer  of  S£(u,  a‘)  discards  Mu.  (The  necessity  for 
distinguishing  between  the  v  and  v  cases  will  become  clear  in  Lemma  5.11.) 

We  will  show  that  our  second  and  third  ideas  are  sufficient  to  guarantee  that  all  processors  in 
SW(u,al)  NE(u, a‘)  -  {u}  are  eliminated  when  phase  i+1  starts. 

For  our  fourth  idea,  let  £u  be  an  Eliminate  message  that  is  not  discarded  in  phase  i  *  =  [  logavrT|  . 
Then  £u  loops  around  the  row  containing  processor  u  and  reaches  u  from  the  west.  We  will  prove  that  in 
phase  i  *,  there  is  at  most  a  constant  number  of  processors  that  receive  their  Eliminate  messages  from  the 
west.  Thus  for  phase  i  +1,  the  live  processors  execute  a  simple  algorithm  that  guarantees  the  uniqueness 
of  the  leader.  At  the  start  of  phase  i  *+l ,  u  sends  to  the  north  a  Final  message  Fu  that  contains  ID  (u). 

Let  z  be  the  processor  with  the  largest  identifier  in  the  network.  In  the  proof  of  correctness,  we  will  prove 
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that  z  executes  phase  <*+l.  Thus  if  Fu  reaches  a  processor  z  in  the  row  Row  (z)  containing  z,  then 
ID(u)  <  ID  (z)  =  largest-seen  (z  ),  and  z’  discards  F„.  On  the  other  hand,  Ft  traverses  the  entire  column 
containing  z.  When  z  receives  Fz  from  the  south,  then  z  knows  that  z  has  the  largest  identifier  in  the 
network,  and  that  all  Final  messages  Fu  with  u  *2  will  be  discarded  when  they  encounter  Row  (z).  Thus 
z  elects  itself  as  the  unique  leader. 

4J 2  Details  of  the  algorithm 

Each  processor  u  keeps  a  variable  called  largest -seen  (u,i).  Intuitively,  largest-seen  (u,i)  is  the 
largest  identifier  in  an  Eliminate  message  that  u  receives  in  phase  i  from  the  west  or  from  the  north.  At 
the  start  of  phase  i,  largest-seen(u,i )  =  nil,  where  nil  is  smaller  than  the  identifier  of  every  processor. 

For  every  time  r,  largest-seen  ( u,i )  at  t  is  no  more  than  largest-seen  (u)  at  t.  Also,  largest-seen  <u,i) 
can  never  be  equal  to  ID(u).  Processor  u  also  maintains  a  variable  called  eliminated -from  (u). 
Intuitively,  eliminated  -from  (u )  specifies  the  direction  from  which  the  message  that  eliminated  u  came. 
Initially,  eliminated -from  (u)  is  undefined.  If  u  is  live  at  the  start  of  phase  i,  then  largest -seen  (u)  - 
ID(u). 

At  the  start  of  each  phase  i  >0,  each  live  processor  y  sends  the  message  Ey  = 
"Eliminate_[/D(y),a',niI]',  to  the  east.  Next,  each  processor  u,  whether  u  is  live  or  dead,  processes  each 
message  that  u  receives.  Suppose  that  u  receives  the  message  £v  =  "Eliminate_[/D  (y),k,ID)"  from  the 
direction  dir,  where  v  is  some  processor,  1  £  k  £  a‘,  and  dire  {west,north,east,south}.  An  invariant  of 
the  algorithm  is  that  ID'  =  nil  if  dir  is  west  or  north;  otherwise,  ID  is  largest-seen  (w,i)  for  some  w  on 
the  south  boundary  of  SE  (v,  a‘).  Depending  on  the  value  of  dir,  u  executes  one  of  the  following  four 
cases: 

Case  1:  dir  =  west: 


There  are  two  subcases: 
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Case  1.1:  ID  (v)  *  ID  ( u ): 

Then  £„  is  a  message  from  some  processor  v  different  from  u. 

If  largest -seen  (u)  >  ID  (v),  then  u  discards  £v.  Processor  u  sets  largest -seen  ( u,i )  to 
max{largest-seen(u,i)JD  (v)} .  Also,  if  k  *  1,  then  u  sends  "Mark_[/D(v),fc-1]"  to  the  east. 

If  largest -seen  (u)  ^  !D(v),  then  u  sets  largest -seen  (u)  to  ID  (v),  u  sets  eliminated  -from  (u )  to 
west,  and  u  is  eliminated.  Processor  u  sets  largest -seen(u,i)  to  /D(v).  Also,  if  k  *  1,  then  u  sends 
"Eliminate_[/D  (v),jfc-l,nil]''  to  the  east  If  k  =  1,  then  u  is  at  the  northwest  comer  of  SE (v,  a‘),  and  u 
sends  ”Eliminate_[/D(v),a‘,nil]"  to  the  south. 

Case  12:  ID  (v)  =  ID  («): 

Then  £v  is  an  Eliminate  message  from  u.  Thus  £„  originated  at  u,  traversed  the  entire  row  Row  ( u > 
containing  u,  and  returned  to  u.  No  processor  other  than  u  is  currently  live  in  Row  (u)  since  £v  was  not 
discarded.  Thus  u  executes  the  procedure  FINAL  defined  as  follows.  Processor  u  sends  the  message 
Fu  =  "FinalJ/D  (u)]”  to  the  north.  Each  processor  v*u  with  largest-seen  ( v)<ID  (u)  passes  Fu  to  the 
north.  If  Fu  arrives  at  some  v  with  ID  ( u )  <  largest -seen  (v),  then  v  discards  Fu.  If  Fu  arrives  at  u,  then 
u  elects  itself  as  the  leader. 

Case  2:  dir  =  north: 

Then  ID(v)*  ID  (u). 

If  largest  -seen  (u)  >  lD(v),  then  u  sets  largest-seen{u,i)  to  ma x[largest-seen(u,i)JD{v)},  and  u 
discards  £y.  Also,  if  k  *  1,  then  u  sends  "Mark_[/D(v),&-1  ]"  to  the  south.  If  k  =  1,  then  u  sends 
"Mark_[/D(v),a‘]"  to  the  west. 

If  largest  -seen  (u)  S  /O(v),  then  u  sets  largest-seen(u)  to  ID(v),  u  sets  eliminated  -from  (u)  to 
north,  and  u  is  eliminated.  Also,  if  k  *  1,  then  u  sends  "Eliminate_[/D (v),jfc-l,nil]"  to  the  south.  If 
k-  1,  then  u  sends  "Eliminate_[/D(v),a‘, largest -seen{u,i)]"  to  the  west.  Finally,  u  sets 
largest -seen  (u,i)  to  /D(v). 
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Case  3:  dir  =  east: 

Then  ID  (v)  *  ID  (a). 

If  largest  -seen  (a)  >  ID(v),  then  a  discards  Ev.  Processor  a  sets  largest -seen  (u,i)  to 
ma x[largest-seen(u,i)JD  (y)) .  Also,  if  k  *  1,  then  a  sends  ”Mark_[/D (v),Jt— l]"  to  the  west.  If  k  =  1, 
then  u  does  not  send  a  Mark  message. 

If  largest -seen  (a)  <  ID  (v),  then  a  sets  largest-seen  (a)  to  ID  (v),  and  a  is  eliminated.  Also,  u  does 
one  of  the  following: 

Case  3.1:  k  *  1; 

Processor  a  sets  eliminated -from  (a)  to  east.  Also,  if  ID  (a)  =  ID  and  largest  -seen  (a,  i)  <ID  , 
then  a  sends  "Eliminate_[/D  (v),k-l ,nil]"  to  the  west.  Otherwise,  a  sends  "Eliminate J7D  (v),k -I , 
ma \{ID  ,largest-seen(u,i)}\"  to  the  west.  Finally,  a  sets  largest -seen  (uj)  to  /D(v). 

Case  3.2:  k  =  1; 

Processor  a  is  at  the  southwest  comer  of  SE  (v,  a‘).  If  ID(u)  =  ID  ,  then  a  sends 
"Eliminate_[/D  (v),a‘,nil]"  to  the  north,  and  a  sets  eliminated -from  (a)  to  east.  Processor  a  sets 
largest  -seen  (u,i)  to  /D(v). 

If  ID  (a)  *  ID  and  ID  £  largest  -seen  {u,i)<  then  a  sends 
"Eliminate__[/D  (v),  a‘,  largest  -seen  (u,i)J”  to  the  north.  Also,  if  eliminated  -from  (u )  =  west,  then  a 
sends  =  "Kill_  [largest -seen(u,i)]"  to  the  west.  Every  processor x  that  receives  Kv  forwards  the 
message  to  the  west  until  Kv  reaches  a  processor  x  such  that  ID  (x)  =  largest-seen  ( u,i ).  Processor  x 
becomes  eliminated,  but  eliminated  -from  (x)  is  not  changed.  If  eliminated  -from  (a)  *  west,  then  a  does 
not  send  a  Kill  message.  Finally,  a  sets  eliminated  -from  (a)  to  east  so  that  a  initiates  at  most  one  Kill 
message  in  phase  i.  Also,  a  sets  largest -seen(u,i)  to  ID  (v). 

If  ID  (a)  *  ID  and  ID  >  largest -seen  (a,i),  then  a  sets  eliminated  -from  (a)  to  east  and  sends 
"Eliminate_[/D(v),a‘,/D  ]"  to  the  north.  Processor  a  sets  largest-seen(u,i)  to  /D(v). 
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Case  4:  dir  =  south: 

If  ID  ( v>=/£>  (u),  then  it  =  1  and  £v  is  a  message  from  «.  If  u  is  live  when  u  receives  £v,  then  u 
continues  to  receive  messages  for  an  additional  a‘  time  units,  and  u  advances  to  phase  i +1  if  u  does  not 
receive  a  Kill  message  during  these  ai  time  units.  If  u  is  dead  when  u  receives  £v,  then  u  discards  £v. 

If  ID  (v)*ID  ( u )  and  largest -seert(u)  >  ID  (v),  then  u  discards  £v.  Processor  u  sets 
largest -seen  (u,i)  to  max[largest-seen(u,i)JD  (v)}. 

If  ID(v)*ID(u)  and  largest -seen(u)  <  /D(v),  then  &  *  1,  and  u  is  eliminated.  Also,  u  sends 
"Eliminate_[/D(v),£-1,/D  ]"  to  the  north.  In  addition,  if  ID  =  largest  -seen  (u)  and 
eliminated  -from  (u)  =  west,  then  u  also  sends  "Kill_[/D  ]"  to  the  west.  Otherwise,  u  does  not  send  a 
Kill  message.  Finally,  u  sets  eliminated -from  (u)  to  south  so  that  u  initiates  at  most  one  Kill  message  in 
phase  i.  Processor  u  sets  largest-seen  ( u,i )  to  ID  (v). 

Suppose  that  u  receives  the  message  "Mark_[/D  (v),yfc]"  from  the  direction  dir,  where  1  <k<a‘,v 
is  some  processor,  and  dire  {west,north,east}.  First,  u  compares  largest  -seen  (u)  with  ID  (v).  If 
largest -seen  (u)  <  ID(v),  then  u  sets  largest  -seen  (u)  to  ID(v),  and  u  is  eliminated.  Next,  regardless  of 
the  relative  sizes  of  largest  -seen  (u)  and  ID(v),  u  does  one  of  the  following  depending  on  the  value  of 
dir. 

Case  1:  dir  =  west: 

If  k  *  1,  then  u  sends  "Mark_[/D(v),Jk-l]"  to  the  east. 

If  k  =  1 ,  then  u  does  not  send  a  Mark  message. 

Case  2:  dir  =  north: 

If  k  *  1,  then  u  sends  "Mark_[/D(v),&-1]"  to  the  south. 

If  k  =  1,  then  u  sends  "Mark_[/D(v),a'J"  to  the  west. 
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Case  3:  dir  =  east: 

If  k  *  1,  then  u  sends  "MarkJVD  (v),/fc-l]”  to  the  west. 

If  k  =  1,  then  u  does  not  send  a  Mark  message. 

Processor  u  remains  live  at  the  end  of  phase  i  if  either  u  receives  Eu  from  the  south  after  4a‘  time  units 
and  does  not  receive  the  message  "Kill_[/D  («)],”  or  u  receives  Eu  from  the  west. 

4.4  Proof  of  Correctness 

Let  z  be  the  processor  with  the  largest  identifier  in  the  network. 

Lemma  4.1:  Processor  z  executes  the  procedure  FINAL(Case  1.2  of  the  algorithm). 

Proof:  Processor  z  does  not  execute  FINAL  only  if  z  is  eliminated  during  some  phase  i  S  i* .  Since  z  is 
the  processor  with  the  largest  identifier,  r  can  not  be  eliminated  by  receiving  an  Eliminate  message. 

Let  Et  be  an  Eliminate  message  that  z  sends  at  the  start  of  some  phase  i.  By  our  choice  of  z,  every 
processor  v  on  the  path  that  Ez  traverses  has  largest-seen(v)  £  ID  (z).  Therefore,  Ez  is  never  discarded, 
and  z  can  not  be  eliminated  by  a  discard  of  Ez. 

The  only  other  way  that  z  may  be  eliminated  is  if  z  receives  a  Kill  message  K  -  "Kill_[/D  (z)]."  By 
the  algorithm,  K  was  spawned  by  some  processor  u  *  z  in  the  same  row  Row  (z)  as  z,  with 
largest -seen  («)  =  ID  (z).  Furthermore,  u  spawned  K  because  u  received  a  message  £v  = 

"Eliminate J7D  (y),k,lD  (z)]"  from  the  direction  dir ,  where  1  <  k  <  a‘,  and  dire  (east.south }.  By  the 
algorithm,  ID(v)  >  largest -seen  (u) -  ID  (z).  Thus,  since  zhas  the  largest  identifier,  /D(v)  =  /D(z),  and 
Ev  was  sent  by  z.  But  an  Eliminate  message  sent  by  z  can  not  reach  a  processor  u  in  Row  (z)  from  the 
east  or  from  the  south,  unless  u  =  z.  Thus  K  does  not  exist.D 
Theorem  4.1:  The  algorithm  elects  z  as  the  leader  when  the  algorithm  terminates. 

Proof:  Processor  z  executes  the  procedure  FINAL  because  z  receives  its  Eliminate  message  from  the 

west  in  phase  i* .  Thus,  at  the  start  of  FINAL,  all  the  processors  v  in  Row{z)  have 

largest  -seen  (v)  =  ID  (z),  and  all  the  processors  except  z  in  Row{z)  are  eliminated.  Suppose  that  another 
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processor  u  is  also  live  at  the  start  of  FINAL.  By  the  algorithm,  z  and  u  send  final  messages  to  the  north. 
Since  z  has  the  largest  identifier,  z’s  Final  message  will  return  to  z  from  the  south.  Either  u's  Final 
message  Fu  is  discarded  before  reaching  Row  (z),  or  Fu  reaches  a  processor  v  in  Row  (z).  In  the  latter 
case,  Fu  is  discarded  since  ID(u)  is  smaller  than  largest-seen  (v)  =  ID  (z).  Thus  u  will  be  eliminated. 
Only  z  announces  itself  as  the  leader.D 

4.5  Message  Complexity 

Lemma  5.1:  Let  r  be  a  processor.  Suppose  that  some  processor  receives  an  Eliminate  message  Er  from 
the  north  or  from  the  east  in  phase  i.  Then  all  the  processors  r 0  in  SE ( r ,  a')  with  ID  ( r0 )  <  ID  (r)  are 
dead  at  the  start  of  phase  i+l. 

Proof:  If  rQ  is  dead  at  the  start  of  phase  i,  then  r0  is  dead  at  the  start  of  phase  i+l.  Hence,  suppose  that 
r0  is  live  at  the  start  of  phase  i.  Since  there  is  a  processor  that  receives  Er  from  the  north  or  from  the 
east  in  phase  i,  every  processor  on  the  south  boundary  of  SE  (r,  a‘)  receives  Er  or  Mr.  Either  ErQ  is 
discarded  before  Ero  reaches  some  processor  on  the  south  boundary  of  SE  (r,  a‘)  from  the  south,  or  Ero 
reaches  a  processor  r  x  on  the  south  boundary  of  SE  (r,  a‘)  from  the  south.  In  the  later  case,  r ,  receives 
ErQ  after  r  x  receives  either  Er  or  Mr.  See  Figure  4.4.  Thus  Ero  reaches  r  t  when  largest -seen  (r  x )  > 

ID  (r)  >  ID  O0),  and  r ,  discards  Ero .  Therefore,  r0  will  be  dead  at  the  start  of  phase  i +1  .□ 

Lemma  5.2:  Let  r  be  a  processor  live  at  the  start  of  phase  i.  Then  all  the  processors  r0  in  NE (r,  a') 

NW  (r,  a‘)  with  ID  (r0)  <  ID  ( r )  are  dead  at  the  start  of  phase  i +1 . 

Proof:  If  r0  is  dead  at  the  start  of  phase  i,  then  r0  is  dead  at  the  start  of  phase  i+l.  Hence,  suppose  that 
r0  is  live  at  the  start  of  phase  i.  Since  r  is  live  at  the  start  of  phase  i,  every  processor  on  the  north 
boundary  of  SE(r,  a‘)  receives  Er  or  Mr. 

Suppose  that  r0  is  in  NW  (r,  a‘).  Either  ErQ  is  discarded  before  Ero  reaches  some  processor  on  the 
north  boundary  of  SE (r,  a‘),  or  Ero  reaches  a  processor  r  ]  on  the  north  boundary  of  SE (r,  a').  In  the  later 
case,  rx  receives  Ero  after  r,  receives  either  Er  or  Mr.  Thus  Ero  reaches  rx  when  largest  -seen  (rx)  > 
lD(r)  >  ID{r0),  and  rx  discards  £ro.  Therefore,  r0  will  be  dead  at  the  start  of  phase  i+l. 


Figure  4.4.  Processor  rQ  in  SE  (r,  a1) 


Now  suppose  that  r0  is  in  NE  ( r ,  a‘).  Either  £ro  is  discarded  before  ErQ  reaches  some  processor  on 
the  north  boundary  of  SE  ( r ,  a‘)  from  the  south,  or  £ro  reaches  a  processor  rx  on  the  north  boundary  of 
SE (r,  a*)  from  the  south.  In  the  later  case,  r  j  receives  £r„  after  r  t  receives  either  £r  or  Mr.  Thus  Ero 
reaches  r  x  when  largest -seen  (r  i )  2  ID  (r)  >  ID  (r  o),  and  r !  discards  £r0-  Therefore,  r o  will  be  dead  at 
the  start  of  phase  i  +1  .□ 

Lemma  5  J:  Let  r  be  a  processor  live  at  the  start  of  phase  i.  If  there  is  a  processor  q  in  SE  (r,  a*) 
SW(r,at)  live  at  the  start  of  phase  i  with  ID  (q)  >  ID(r),  then  r  will  be  dead  at  the  start  of  phase  i+1. 
Proof:  Processor  r  is  in  NE(q,  a!)  vj  NW (q,  ot?).  By  Lemma  5.2,  r  is  dead  at  the  start  of  phase  i+l.n 
Lemma  5.4:  Let  r  be  a  processor  live  at  the  start  of  phase  i,  and  let  q  be  a  processor  in  A fW(r,  a‘)  with 
ID  (q)  >  ID  (r).  Suppose  that  a  processor  receives  the  Eliminate  message  £?  from  the  north  or  from  the 
east  in  phase  i.  Then  r  will  be  dead  at  the  start  of  phase  i+ 1. 

Proof:  Processor  r  is  in  SE  ( q ,  at).  By  Lemma  5. 1 ,  r  is  dead  at  the  start  of  phase  i+1  .□ 
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Fix  processor  a  to  be  live  at  the  start  of  some  phase  i+ 1,  where  0  £  i  £  f  *-l.  Thus,  in  phase  f,  all 
processors  on  the  east  boundary  of  SE  (u,  o')  receive  Eu  from  the  north.  Thus  we  can  apply 
Lemmas  5.1  -  5.4  with  r  =  u.  Lemmas  5.5  -  5. 10  pertain  to  u. 

Lemma  5.5:  All  the  processors  in  SE(u,  a‘){j  MW  (u,  a‘)  -  {u}  are  dead  at  the  start  of  phase  i+1. 

Proof:  Let  u0  *  u  be  a  processor  in  SE  (u,  a‘)  live  at  the  start  of  phase  i.  Since  u  is  live  at  the  start  of 
phase  i+1,  ID  (u0)  <  ID  («)  by  Lemma  5.3.  Thus  u0  is  dead  at  the  start  of  phase  i+1  by  Lemma  5. 1 .  All 
processors  in  SE  (u,  cl‘}-{ u }  are,  therefore,  dead  at  the  start  of  phase  i +1 . 

Suppose,  contrary  to  the  lemma,  a  processor  u !  *■  u  in  NW  ( u ,  a‘)  is  live  at  the  start  of  phase  i + 1 . 
Then,  by  the  argument  we  have  just  given,  since  u  is  in  SE  (ui,a‘),  processor  u  would  be  dead  at  the  start 
of  phase  i+1,  contrary  to  the  hypothesis.  Hence  all  processors  in  MW(u,a‘y-{u)  are  dead  at  the  start  of 
phase  i+l.n 

We  now  show  that  all  processors  in  SW ( u ,  a‘}-fu}  are  dead  at  the  start  of  phase  i + 1 .  Let  *  be  the 
processor  at  the  southwest  comer  of  SE  («,  a‘).  Since  u  is  live  at  the  start  of  phase  i+1,  processor  x  sends 
=  "EIiminate_(/D(u),a',/£>  ]"  to  the  north  in  phase  i,  where  ID  is  an  identifier  of  a  processor  or  is 
nil.  By  the  algorithm,  ID  *  ID(u). 

Lemma  5.6:  If  ID  -  nil  in  then  all  processors  in  SW  {u,  al)-{u}  are  dead  at  the  start  of  phase  i+1 
Proof:  According  to  the  algorithm,  processor*  sends  Em  with  ID  =  nil  because  *  received  the  message 
Eu  =  "Eliminate J7D(u),l, /£)"],"  where  ID"  is  /D(x)  or  is  nil.  By  the  algorithm,  *  is  dead  when*  sends 
£„.  Now  consider  all  processors  w  in  SW  («,  oi)-{ u,x}  that  are  live  at  the  start  of  phase  i.  If  Ew  is 
discarded  before  Ew  reaches  the  south  boundary  of  SE  (u,  a‘ ),  then  w  is  dead  at  the  start  of  phase  i +1 . 

Thus  suppose  that  Ew  reaches  some  processor  uo  on  the  south  boundary  of  SE  (u,  a‘).  and  that  u  o  does  not 
discard  Ew.  Then  largest-seen  ( uQ,i )  S  ID  (w)  when  Eu  reaches  u0.  Thus  ID~  in  Eu  is  at  least  ID  (w). 
Recall  that  ID  e  {nil ,  /£>(*)}.  Since  ID  >  ID  (w)  >  nil,  ID  =ID(x).  Since  w  **,  ID(x)  >  ID(w), 

By  Lemma  5.2,  w  is  dead  at  the  start  of  phase  i+1.  Thus  all  processors  in  SW(«,  a‘)-{u,x}  are  also  dead 
at  the  start  of  phase  i +1.  Hence  all  the  processors  in  SW  (u,  a‘)-{u}  will  be  dead  at  the  start  of  phase 
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Now  suppose  that  ID  *  nil.  Thus  ID  =  7D(v0),  where  v0  is  some  processor.  Since  £«  contains 
ID  (v0),  vo  *  u.  Since  ID  *  nil,  v0  *  x.  Note  that,  by  the  algorithm,  v0  must  be  live  at  the  start  of  phase  i 
for  ID  (v0)  to  be  in  £„.  Lemmas  5.7  -  5.9  pertain  to  u  and  v0. 

Lemma  5.7:  Processor  v0  is  in  SW(n,  a‘). 

Proof:  Let  u0  be  the  western-most  processor  on  the  south  boundary  of  SE (u,  a‘)  with 
largest -seen  (u0,i)  -  /£>(v0)  when  u's  Eliminate  message  Eu  reaches  u0.  Since  largest  -seen  (u0,i)  = 
ID(v0),  processor  u0  received  an  Eliminate  message  £„0  that  contained  /D(v0).  Therefore,  v0  is  on  the 
boundary  of  NW(u0,a‘).  Suppose  that,  contrary  to  the  lemma,  v0  is  not  in  SW (u,  a‘)-  Since  v0  is  on  the 
boundary  of  NW(uo,a‘),  either  vo  is  on  the  north  boundary  of  SE (u,  a‘),  or  vo  is  on  the  south  bocndary 
of  SF  (u.  a1).  We  will  show  that  both  cases  lead  to  contradictions. 

If  v0  were  on  the  north  boundary  of  SE  (u,  a‘),  then  £vo  would  reach  u0  after  Eu  reaches  u0.  Hence 
largest-seen{uo,i)  *■  ID(v0)  when  Eu  reaches  uq,  a  contradiction. 

If  v0  were  on  the  south  boundary  of  SE  ( u ,  a‘),  then,  by  our  choice  of  u0,  u0  =  v0.  By  the 
algorithm,  v0  would  send  the  message  "Eliminate_[/D  (u).Lnil]"  to  the  west  when  v0  received  Eu. 

Hence  £„  would  not  contain  ID  (v0),  a  contradiction. □ 

As  in  Figure  4.5,  let  the  rectangle  SW ( u,q,q ')  denote  a  rectangular  mesh  of  processors  with 
processor  u  at  the  northeast  comer,  <7+1  processors  on  the  east  and  west  boundaries,  and  q'+ 1  processors 
on  the  north  and  south  boundaries.  Let  the  rectangle  NE  ( u,q,q ')  denote  a  rectangular  mesh  of  processors 
with  processor  u  at  the  southwest  comer,  q+ 1  processors  on  the  east  and  west  boundaries,  and  ^'+1 
processors  on  the  north  and  south  boundaries.  Recall  that  u  is  live  at  the  start  of  phase  i  +1. 

Lemma  5.8:  Processor  v0  is  not  in  the  rectangle  SW («,  a1-1  ,a‘). 

Proof:  Processor  u  is  live  at  the  start  of  phase  i.  Hence  u  received  an  Eliminate  message  Eu  in  phase 
i-1.  Thus  all  processors  b  on  the  east  boundary  of  SW(m,  cc‘~1,(x‘)  have  largest -seen  (b)  t  ID  (u)  at  the 
start  of  phase  i.  Since  u  is  live  at  the  start  of  phase  7 +1  and  v0e  SW (u,  a‘),  ID  (v0)  <  ID(u)  by  Lemma 
5.3.  Thus  if,  contrary  to  the  lemma,  vq  were  in  SW («,  a1-1  ,a‘),  then  v0’s  Eliminate  message  £vo  in 


Figure  4.5.  The  Definitions  of  AW  (u,q,q  ),  NE  (u,q,q  ),  SE  ( u,q,q  ),  and  SW  (u,q,q  ) 

phase  i  would  be  discarded  when  £vo  reached  the  east  boundary  of  SW(u,  ex'-1  .ex'),  £vo  would  not  reach 
the  south  boundary  of  SE  ( u ,  a'),  and  ID  (v0)  would  not  be  in  £„,  a  contradiction.  Hence  v0  is  not  in 
SW(«,crI,cx‘)n 

We  will  need  Lemma  5.9  in  the  proof  of  Lemma  5. 10. 

Lemma  5.9:  Let  vw  be  a  processor  on  the  west  boundary  of  SE  ( u ,  a').  Let  w0  be  a  processor  in 
SW (u,  a'),  and  let  r  *  w0  be  a  processor  distinct  from  u.  Suppose  that  largest-seen(vw,i)  =  ID  (w0)  at 
some  time  during  phase  i.  If  largest  -seen  (vw,i)  becomes  ID(r),  then  w0  is  dead  by  the  time  phase  i+1 
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starts. 

Proof:  Since  vw  changes  largest-seen(yw,i )  from  ID(w0)  to  ID  (r),  by  the  algorithm,  vw  receives  an 
Eliminate  message  Er  from  r  such  that  ID  (h>0)  <  ID  (/■). 

If  vw  receives  Er  from  the  north,  then  w0  is  in  SE(r,  a ?)  since  vw  receives  Er  after  v„  receives  w0’s 
Eliminate  message  Ewo .  Since  ID(wo)  <  ID  (r),  then  vv0  is  dead  at  the  start  of  phase  i  +1  by  Lemma  5.1. 

If  vw  receives  £,  from  the  west,  then  r  is  in  the  same  row  Roh>(vw)  as  vw.  Since  vw  receives  Er  after 
vw  receives  EWQ,  w0  is  in  Raw(yw),  and  w0  is  to  the  east  of  r  and  to  the  west  of  vw.  Thus  Er  reaches  w0, 
and  w0  will  be  dead  at  the  start  of  phase  i  +1. 

Finally,  we  show  that  v*  could  not  have  received  E,  from  the  east  or  from  the  south.  If,  to  the 
contrary,  vw  received  Er  from  the  east,  then  r  would  be  in  NW(u,  a1).  By  Lemma  5.4,  since  u  is  live  at 
the  start  of  phase  i  +1,  ID  (r)  <  ID  (u).  Hence  Er  would  be  discarded  when  Er  reached  the  north  boundary 
of  SE  ( u ,  a1)  and  before  Er  reached  vw.  If  vw  received  Er  from  the  south,  then  r  would  be  in  the  same 
column  as  vw.  Further,  r  would  be  in  SE  (u,  a‘)  or  in  NW(u,  at).  In  either  case,  ID  (r)  <  ID  (u)  by 
Lemmas  5.3  and  5.4,  since  u  is  live  at  the  start  of  phase  i+1.  If  re  SE  ( u ,  a'),  then  Er  would  be  discarded 
when£r  reached  the  south  boundary  of  S£(«,  a*)  and  before  Er  reached  vw.  If  re  NW(u,a‘),  then  Er 
would  be  discarded  when  Er  reached  the  north  boundary  of  NW(u,  o')  and  before  Er  reached  vw.  Thus, 
vw  would  not  set  largest -seen  (vw,i)  to  ID(r),  a  contradiction.D 

Lemma  5.10:  If  a  S  2,  then  all  processors  in  SW(u,  a‘){jNE  ( u ,  a ‘H« }  are  dead  at  the  start  of  phase 
i+1. 

Proof:  We  first  show  that  all  processors  in  SW  (u,  o are  dead  at  the  start  of  phase  i+1.  Consider 
phase  i,  and  consider  all  processors  w0  in  SW(u,  a')  live  at  the  start  of  phase  i  with  ID(w 0)  >  !D(v0). 

We  will  show  that  w0’s  Eliminate  message  £WQ  in  phase  i  is  discarded  before  or  when  Ewo  reaches  the 
south  boundary  of  SE(u,  o'),  and  thus  w0  will  be  dead  at  the  start  of  phase  i  +1  If  Ewo  were  not 
discarded  before  or  when  £wo  reached  the  south  boundary  of  SE  (u,  c **),  then,  since  ID  (w0)  >  ID  (v0),  £„ 
would  not  contain  /D(v0);  instead,  E^  would  contain  ID(wq)  or  a  larger  identifier,  a  contradiction. 
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Now  consider  all  processors  vv0  in  SW(u,  a‘)  live  at  the  start  of  phase  i  with  ID  (w o)  <  ID  (v0).  By 
Lemma  5.8,  v0  is  not  in  the  rectangle  SW (u,  a1-1  ,af ).  Since  a  S  2, 

SW(u,a,)clV£(vo,<x‘)u^W(vo.a,')u5£(vo,<xi-l,oci)^J51V(vo,a‘-1,a,). 

If  wQeNE  (v0,a‘)  AW(v0, a‘),  then  w0  is  dead  at  the  start  of  phase  i+1  by  Lemma  5.2. 

If  M'oe5£(v0,a‘"1,a‘),  then  w0  is  dead  at  the  start  of  phase  i+1  by  Lemma  5.1. 

Now  suppose  that  woeSW(vo,a'-1  ,a').  Since  vo  is  live  at  the  start  of  phase  i,  processor  v0 
received  the  Eliminate  message  that  v0  sent  in  phase  i-1.  Thus  all  processors  v ;  on  the  east  boundary  of 
SWOo.a'-’.a')  have  largest-seen  (vx)  2  ID  (v  q)  >  ID  (w0)  at  the  start  of  phase  i.  Hence,  £wo  will  be 
discarded  if  Ewg  reaches  the  east  boundary  of  SW  (v0,a'-1  ,a').  Consequently  w0  will  be  dead  at  the  start 
of  phase  i+1. 

We  now  show  that  v0  is  dead  at  the  start  of  phase  i +1 .  Let  vw  be  the  first  processor  on  the  west 
boundary  of  SE (u,  a‘ )  to  receive  £vo .  If  £vo  is  discarded  when  £vo  reaches  vw,  then  v0  is  dead  at  the  start 
of  phase  i+1.  Suppose  now  that  vw  does  not  discard  £vo.  When  £vo  reached  vw,  processor  vw  set 
largest -seen  (vw,i)  to  lD(y0)  and  eliminated  -from  (vw)  to  west.  By  the  algorithm,  vw  changes 
eliminated -fir om{vw)  only  if  vw  changes  largest-seen(ywti).  By  Lemma  5.9,  if  largest  -seen  (yw,i)  is 
not  /D(v0)  when  £„  reaches  vw,  then  v0  is  dead  at  the  start  of  phase  i+1.  Thus  suppose  that 
eliminated -from  (yw)  is  west  and  largest -seen  (vw,i)  is  ID  (vo).  By  the  algorithm,  vw  sends  a  Kill 
message  to  v0  in  phase  i.  Thus  v0  will  be  dead  at  the  start  of  phase  i+1. 

We  have  shown  so  far  that  all  processors  in  SW(u,a‘)-{u)  are  dead  at  the  start  of  phase  i+1.  If  a 
processor  r  *  u  in  NE  (u,  a‘)  were  live  at  the  start  of  phase  i +1 ,  then,  by  the  argument  we  have  just  given, 
since  u  is  in  SW(r,a')  u  would  be  dead  at  the  start  of  phase  i+1,  contrary  to  the  hypothesis.  Hence  all 
processors  in  NE  (u,  <x‘)-{u }  are  dead  at  the  start  of  phase  i +1  .□ 

We  say  that  u  generates  k  messages  in  phase  i  if 
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(the  number  of  links  that  u’s  Eliminate  message  £„  traverses) 

+  (the  number  of  links  that  a  Mark  message  spawned  by  £„  traverses) 

+  (the  number  of  links  that  a  Kill  message  spawned  by  £*  traverses)  =  k. 

Lemma  5.11:  The  algorithm  uses  at  most  3 n  messages  in  phase  0. 

Proof:  Consider  a  processor  uq.  Suppose  that  vo  is  the  processor  immediately  to  the  south  of  uq,  and  w0 
is  the  processor  immediately  to  the  west  of  uq.  See  Figure  4.6.  Processor  uq  generates  at  most  five 
messages  in  phase  0:  four  Eliminate  messages  and  one  Kill  message.  Suppose  that  u0  generates  four  or 
five  messages.  Then  uo  receives  uQ's  Eliminate  message  £uo  from  the  south.  If  ID  (u)  were  less  than 
ID  (w0),  then  w0’s  Eliminate  message  EWQ  would  reach  v0  before  Euo ,  and  v0  would  discard  £uo .  Thus 
ID  (w0)  <  ID  (i i).  Hence  u0  discards  EWQ ,  and  w0  generates  only  one  message. 

Since  every  processor  u0  that  generates  at  least  four  messages  has  a  processor  w0  to  the  west  that 
generates  only  one  message,  the  average  number  of  messages  per  processor  in  phase  0  is  at  most  three. □ 

Theorem  5.1:  The  algorithm  uses  at  most  messages. 

I  O 

Proof:  Suppose  that  u  is  live  at  the  start  of  phase  i,  where  1  £  t  £  <* .  By  Lemmas  5.5  and  5. 10  with 
a£  2,  all  the  processors  in S£(«,a‘_1)  tj  SW(u,ot‘“1)  tj  NE(u, a2-1)  JVW'(B,o,'1H“i  are  dead  at 


Figure  4.6.  Processors  \vq,  u q,  and  v0 
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the  start  of  phase  i.  Hence  there  are  at  most  ( — — - — )2  processors  u  live  at  the  start  of  phase  i.  Processor 

a*'1*! 

u  generates  at  most  5a‘  messages  in  phase  i,  where  1  £  i  £  i*-l.  In  phase  i * ,  u  generates  at  most  a' 
messages.  Thus  the  total  number  of  messages  used  by  the  algorithm  in  phases  0  through  i  *  is  3n  + 
i’-i  n  ••  n  n 

£  5  a‘  — — - =-  +  a‘  — — - j-.  At  most  — -r- - r  processors  u  execute  FINAL,  and  u 

i.i  (c^_1+l)2  (a‘  -*+1)  (a‘  -‘+11 


generates  at  most  Vrc  messages  in  FINAL.  Hence  the  algorithm  uses  at  most  — ~ — messages  in 

(a‘  _1+1) 

FINAL.  The  total  number  NUM  of  messages  that  the  algorithm  uses  satisfies 


NUMZ3n+  £  5  tt1  •  •  .+a;‘  —^7 — r  +  ^ — ^ — , 

it,  (oT'+i)2  (o'  -’+1)2  (a'  -4l)2 

Consider  each  term  in  inequality  (4.1): 


a‘  — ~7 — r  +  — — — r^*<2a‘  — 1  ,i*=riogaVnl 
(ex'  -‘+1)  (a‘  -t+l)  (a*  -1) 


5  2  a =  0(yfn)  =  o(n) 


‘£5  a‘  —7 — T  =  —  +  5n~  +‘£ 5  a'  — 7 — T 
(a‘_1+l)2  4  (a+1)2  £  (cT'+l)2 


SS2«+W  +'£  » 

4  (a+1)2  £  (a'"1)2 

5na  .  Snot2  .5 n 

-  — __  + - -  +  — _  +  a(n) 

4  (a+1)2  a-1 

The  value  of  a  that  minimizes  expression  (4.2),  subject  to  1  <  a  <  2,  is  a  =  2.  Hence, 
NUM  <,^-+o(n).n 

1  O 


4.6  Time  Complexity 

Theorem  6.1:  The  algorithm  runs  in  time  0(Vn). 

Proof:  Each  phase  i  lasts  5a‘  time  units,  where  0<  i  i  »*-l.  Phases  i *  and  j*+1  last  O(Vn)  time  units. 


Hence,  the  running  time  of  the  algorithm  is 

i'-i 

0( VJT)  +  2  Sat  =  0(Vn).D 

imo 

4.7  Lower  Bounds 

For  our  lower  bound  proof,  we  generalize  the  techniques  of  Frederickson  and  Lynch  [22]. 

4.7.1  Assumptions 

We  now  present  the  assumptions  we  use  to  obtain  our  lower  bounds.  For  convenience,  we  assume 
that  each  processor  is  indexed  by  a  unique  number  chosen  from  the  set  [0, 1,  •  •  •  ,n— 1 } .  When  we  say 
"processor  p "  we  mean  the  "processor  with  index  p."  The  index  of  a  processor  is  not  necessarily  the  same 
as  the  processor’s  identifier.  We  assume  that  each  message  that  processor p  sends  contains  p’s  entire 
state.  The  current  state  of  p  incorporates  all  the  messages  that  p  received  so  far  as  follows.  Recall  that  ps 
denotes  p’s  north  neighbor,  p£  denotes  p’s  east  neighbor,  ps  denotes  p’s  south  neighbor,  and  pw  denotes 
p’s  west  neighbor.  Let  state(p,i )  denote  the  state  of  processorp  at  the  start  of  round  i.  Then  state  (p,  0)  = 

( ID  (p) );  state  (p,i)  =  (s  \  jeAsJwJn)’  for  each  round  i  5  1,  where: 

Si  =  state  (p,i~  1); 

se  =  state  (pE,i-\)  if p  received  a  message  from  p£  in  round  t-1,  sg  =  nil  otherwise; 

ss  =  state  (pj, t-1)  if  p  received  a  message  from  p$  in  round  t  —1 ,  ss  =  nil  otherwise; 

sw  =  state  (pw A- 1)  if  p  received  a  message  from  pw  in  round  i-1,  =  nil  otherwise; 

.j/v  =  state  (pN,i-l)  if  p  received  a  message  from  ps  in  round  t-1,  %  =  nil  otherwise. 

We  say  that  the  two  states  s  and  s  are  order-equivalent  provided  that  r  and  s  are  structurally 
equivalent,  and  that  if  two  identifiers  in  s  satisfy  one  of  the  order  relations  <,  =,  or  >,  then  the 
corresponding  identifiers  in  s  satisfy  the  same  order  relation.  An  election  algorithm  is  a  comparison 
algorithm  provided  that  if  s  and  s  are  order-equivalent  processors,  then  processors  with  states  s  and  5 
transmit  messages  in  the  same  direction  and  have  the  same  election  status.  The  algorithm  we  presented  in 
Section  4.3  is  a  comparison  algorithm. 
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4.7.2  Executions 

A  configuration  C  of  size  n  is  a  vector  of  size  n  that  specifies  the  state  of  each  processor  the  state  of 
p  is  in  position  p  in  C.  A  message  vector  M  of  size  n  is  a  vector  of  n  4-tuples  that  specifies  the  messages 
sent  by  each  processor  in  one  round:  position  p  in  C  specifies  the  messages  that  p  sent  to  p’s  east,  south, 
west,  and  north  neighbors.  The  execution  of  an  election  algorithm  is  a  sequence  of  triples  ( C  [  M,C2), 
where  C\  and  C2  are  configurations,  and  M  is  a  message  vector,  all  of  size  n.  We  require  all  executions  e 
to  satisfy  several  properties.  First,  the  initial  configuration  in  e  specifies  the  identifier  of  each  processor. 
Second,  the  second  configuration  in  each  triple  in  e  must  be  the  same  as  the  first  configuration  in  the  next 
triple.  Third,  let  state  ( p,C )  denote  the  state  of  processor  p  in  configuration  C.  The  4-tuple  in  position  p 
in  Af  must  contain  the  messages  that  p  sends  when  p  is  in  state  ( p,C\ ).  Finally,  C2  must  be  the 
configuration  after  C  i  when  all  the  processors  receive  the  messages  in  M .  An  execution  fragment  is  any 
finite  prefix  of  an  execution. 

4.7.3  Chains 

Let  R  ( p,q )  be  a  path  that  connects  processors  p  and  q.  Then  the  length  of  R  (p,q)  is  the  number  of 
processors  in  Rip, q),  including  p  and  q.  We  use|7?(p,<7)|to  denote  the  length  of  R(p,q).  Let  k>  1  be  an 
integer.  We  define  p’s  k-diamond  to  be  the  set  of  all  processors  q  such  that  there  exists  a  path  R(p,q)  of 
length  |  R  ip,q )  |  <,  k.  Figure  4.7  shows  p's  3-diamond. 

Let  P  be  p’s  fc-diamond,  and  Q  be  q's  (fc-diamond.  We  call  p  the  center  of  P  and  q  the  center  of  Q. 
Recall  that  pg  (respectively  ps,pw .  Pn)  is  p’s  east  (respectively  south,  west,  north)  neighbor.  Suppose 
that  e  is  an  execution  or  an  execution  fragment.  Then  an  east-chain  in  e  for  (P,Q)  is  a  subsequence  e, , , 
e,2,  •  •  • ,  e,t,  of  e  such  that  the  following  three  conditions  are  true: 

(1)  There  exists  a  processor  p  and  a  path  R  ip,p  )  of  length  k.  Path  R  (p,p  )  must  be  of  the  form 

p  P  2  P(k-2)PEP>  for  some  processors  3-  P(*-2>- 

(2)  There  exists  a  processor  q  and  a  path  R  (q,q  )  of  length  k.  Path  R  {q,q  )  must  be  of  the  form 
?  <72  ’  ’ '  <?(*-2)  <7e  <7.  for  some  processors  q2,q2,  ■  •  ■ ,  q (Jk_2). 
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Figure  4.7.  Processor  p's  3-diamond 

(3)  Let p  be /?!,/>£  bep(*_i),p  bep*,  q  be  q^ ,  q^  be  and  q  be  qk.  Let po  be  some  processor 
adjacent  to  p  ,  and  qo  be  some  processor  adjacent  to  q  .  Then  for  each  step  etj  in  the  chain,  a 
message  is  sent  either  by  processor P{j-\)  to  processor  p;  or  by  processor  q<j-\)  to  processor  qr 

Thus  an  east-chain  for  ( P,Q )  describes  combined  information  flow  to  p  and  q.  We  call  the  chain  an 
eatt-chain  because  p  or  q  receives  its  information  from  p£  or  qg,  respectively.  We  use  similar  definitions 
for  south<hains,  west-chains,  or  norr/i -chains. 


A  chain  is  either  an  east-chain,  a  south-chain,  a  west-chain,  or  a  north-chain.  Two  ^-diamonds  P 
and  Q  are  order-equivalent  provided  that  if  the  identifiers  of  two  processors  in  P  satisfy  one  of  the  order 
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relations  <,  =,  or  >,  then  the  identifiers  of  the  corresponding  processors  in  Q  satisfy  die  same  order 
relation.  Two  processors  p  and  q  are  k-equivalent  provided  that  the  ^-diamonds  centered  at  p  and  q  are 
order-equivalent.  If  P  and  Q  are  two  fc-diamonds,  then  the  states  s  and  t  are  congruent  with  respect  to 
( p,Q )  provided  that  s  and  t  are  structurally  equivalent,  and  corresponding  positions  in  s  and  t  contam  the 
identifiers  of  processors  in  corresponding  positions  of  P  and  Q,  respectively. 

Lemma  7.1:  Let  e  be  an  execution  fragment  of  a  comparison  algorithm.  Suppose  that  it  is  a  positive 
integer.  Let  p  and  q  be  any  pair  of  /t-equivalent  processors,  and  let  P  and  Q  be  their  respective  k- 
diamonds.  If  there  are  no  chains  in  e  for  ( P,Q ),  then  at  the  end  of  e,  the  states  of  p  and  q  are  congruent 
with  respect  to  (P,Q). 

Proof:  The  proof  is  by  induction  on  the  length  of  e. 

Base:  j  e  j  =  0.  Neither  p  nor  q  has  received  any  messages  in  e,  so  they  will  remain  in  states  that  are 
congruent  with  respect  to  ( P,Q ). 

Inductive  Step  |  e  \  >  0.  Assume  as  the  induction  hypothesis  that  the  result  holds  for  any  execution 
fragment  of  length  shorter  than(e  j  and  all  values  of  k.  Let  e  denote  e  except  fore’s  last  step.  Then  by 
the  inductive  hypothesis,  p  and  q  remain  in  states  that  are  congruent  with  respect  to  (P,Q)  up  to  the  end  of 
e  .  Consider  what  happens  at  the  last  step. 

Case  l  All  of  the  following  hold: 

(a)  Either  pE  and  qE  are  in  states  that  are  congruent  with  respect  to  (P,Q)  just  after  e  ,  or  else  neither pE 
nor  qE  sends  a  message  to  the  west  at  the  last  step  of  e. 

(b)  Either  ps  and  qs  are  in  states  that  are  congruent  with  respect  to  ( P,Q )  just  after  e  ,  or  else  neither  ps 
nor  qs  sends  a  message  to  the  north  at  the  last  step  of  e. 

(c;  Either pw  and  qw  are  in  states  that  are  congruent  with  respect  to  (P,Q)  just  after  e  ,  or  else  neither 
Pw  nor  qw  sends  a  message  to  the  east  at  the  last  step  of  e. 

(d)  Either  p ^  and  q^  are  in  states  that  are  congruent  with  respect  to  (P,Q)  just  after  e  ,  or  else  neither  p_v 

nor  qN  sends  a  message  to  the  south  at  the  last  step  of  e. 
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In  this  case,  it  is  easy  to  see  that  p  and  q  remain  in  states  that  are  congruent  with  respect  to  (P.Q) 
after  e.  For  if  pE  and  qE  are  in  states  that  are  congruent  with  respect  to  (P,Q)  just  after  e  ,  then,  since  the 
algorithm  is  a  comparison  algorithm,  they  both  make  the  same  decision  about  whether  to  send  a  message 
to  the  west  at  the  last  step  of  e.  If  they  both  send  a  message,  then  the  messages  they  send  contain  their 
respective  states,  which  are  congruent  with  respect  to  (P.Q).  A  similar  argument  applies  to  ps  and  q$.  to 
pw  and  qw,  and  to  ps  and  qs-  It  follows  that  p  and  q  remain  in  states  that  are  congruent  with  respect  to 
(P.Q)  after  the  last  step  of  e. 

Case  2:  Processors  pE  and  qE  are  in  states  that  are  not  congruent  with  respect  to  (P,Q)  just  after  e  ,  and  at 
least  one  of  them  sends  a  message  to  the  west  at  the  last  step  of  e.  We  will  show  that  this  case  leads  to 
contradictions. 

If  k  -  1  (i.e.,  if  P  and  Q  consist  only  of  p  and  q,  respectively),  then  an  east-chain  for  ( P.Q )  is 
produced  by  the  message  sent  at  the  last  step,  a  contradiction.  So  assume  that  k  >  1.  Since  p  and  q  are 
k-equivalent,  it  follows  that  pE  and  qE  are  (k-l)-equi  valent.  Let  P  and  Q  denote  their  respective  (k-\)- 
diamonds.  Since  the  states  of  pE  and  qE  just  after  e  are  not  congruent  with  respect  to  (P.Q),  they  are  also 
not  congruent  with  respect  to  (JP  ,Q  ).  By  the  inductive  hypothesis,  there  must  be  a  chain  in  e  for 
(P  ,Q  ).  Since  at  least  one  of  pE  and  qE  sends  a  message  to  the  west  at  the  last  step  of  e,  we  obtain  an 
east-chain  in  e  for  ( P,Q )  by  appending  this  step  to  e  ,  a  contradiction. 

Case  3:  Processors  ps  and  q$  are  in  states  that  are  not  congruent  with  respect  to  (P.Q)  just  after  e  ,  and  at 
least  one  of  them  sends  a  message  to  the  north  at  the  last  step  of  e.  This  case  is  similar  to  Case  2  and 
also  leads  to  contradictions. 

Case  4:  Processors  pw  and  qw  are  in  states  that  are  not  congruent  with  respect  to  (P.Q)  just  after  e  ,  and 
at  least  one  of  them  sends  a  message  to  the  east  at  the  last  step  of  e.  This  case  is  similar  to  Case  2  and 
also  leads  to  contradictions. 

Case  5:  Processors  ps  and  qs  are  in  states  that  are  not  congruent  with  respect  to  (P.Q)  just  after  e  ,  and 
at  least  one  of  them  sends  a  message  to  the  south  at  the  last  step  of  e.  This  case  is  similar  to  Case  2  and 
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also  leads  to  contradictions. □ 

Let  max-east  (e)  be  the  maximum  k  for  which  there  are  order-equivalent  ^-diamonds  P  and  Q 
(possibly  with  P  =  Q)  such  that  e  contains  an  east-chain  for  (P,Q).  The  quantities  max-south  (e), 
max-west  (e),  and  max -north  (e)  are  defined  analogously.  Let  sum  (e)  =  max  -east  (e)  +  max -south  (e) 

+  max-west  ( e )  +  max-north  (e). 

Lemma  7.2:  Let  k  be  a  positive  integer.  Assume  that  mesh  S  is  such  that,  initially,  every  ^-diamond  has 
at  least  ik  order-equivalent  ^-diamonds.  Let  e  be  any  execution  fragment  of  a  comparison  algorithm  in  S, 
and  let  e  be  another  fragment  consisting  of  all  but  the  last  step  of  e.  Assume  that  sum  (e  )  <  k.  If  some 
processor  p  sends  a  message  in  the  direction  dir  at  the  last  step  of  e,  then  there  are  at  least  ik  processors 
that  send  a  message  in  the  direction  dir,  where  dir  e  {east,south,west,north}. 

Proof:  Processor  p  has  at  least  4  ^-equivalent  processors  (including  p  itself).  Let  q  be  one  of  these 
processors,  and  let  P  and  Q  be  the  ^-diamonds  centered  at  p  and  q,  respectively.  By  the  definition  of 
sum{e  ),  there  can  not  be  a  chain  in  e  for  (P,Q).  By  Lemma  7.1,  processors  p  and  q  are  congruent  with 
respect  to  (P,Q)  at  the  end  of  e  .  By  the  definition  of  a  companson  algorithm,  q  also  sends  a  message  in 
the  direction  dir  at  the  last  step  of  e.O 

Lemma  7.2  is  also  true  when  max  {max-east(e  ),max—south(e  ),max—west(e  ), 
max -north  (e  )}  <  k. 

4.7.4  Replication  symmetry 

We  will  prove  the  lower  bound  on  the  number  of  messages  that  comparison  algorithms  require  for 
election  in  n-meshes  when  n  is  a  power  of  4.  We  first  assign  identifiers  that  have  a  large  amount  of 
replication  symmetry  to  the  processors  in  an  rc-mesh  S. 

Define  a  k-moddiamond  A  to  be  the  set  of  all  processors  on  the  boundary  of  or  enclosed  by  the 
figure  in  Figure  4.8.  We  say  that  the  square  abed  (in  that  order)  is  the  center  of  A.  Consider  any  n-mesh 
S,  where  n  =  A1 ,  for  some  l  >2.  Let  A  ]  be  a  (2(/-I)  -  1  )-moddiamond  in  S.  Then  the  remaining  processors 
in  5  form  another  (2(,~1)  -  l)-moddiamond  A2  in  S. 
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We  will  assign  identifiers  to  the  processors  as  follows: 

(1)  Assign  a  1(0)  to  the  least  significant  digit  of  all  the  processor  identifiers  in  A  \  ( A  2 ). 

(2)  In  what  follows  we  consider  A  The  same  should  be  done  for  A^.  Create  4  (2(,_2)  -  1)- 
moddiamonds  A  (ito)>  A  (i,i)i  A  (1,2)*  A  (1,3)  with  centers  Oq  b  q  cq  do,  Q\  b  i  c  i  d\,  o2  b  2  C2  d2>  2nd 
a2  bj  cj  d2,  respectively,  where  there  are 

2(/-2)+I  processors  between  ciq  and  a  (including  aQ  and  a),  and  aQ  is  to  the  north  of  a. 

2(/_2)+l  processors  between  a\  and  a  (including  a  1  and  a),  and  <2j  is  to  the  east  of  <2. 

2(i-2)+l  processors  between  a2  and  a  (including  ct2  and  a),  and  a2  is  to  the  south  of  a. 

2(,“2)+l  processors  between  <23  and  a  (including  <23  and  a),  and  <23  is  to  the  west  of  a. 

Assign  a  (((respectively  1,2,3)  to  the  second  least  significant  digit  of  processor  identifiers  in 
A  (i  -0)  (respectively  A  (1 . 1  >  A  a  ,2)  .A  (i  .3))- 

(3)  Recursively  create  16  (2('_3)-l)-moddiamonds  A(li0io)..4(ii0ti),  A(10i2),  A(li0.3)Mn,i,o).  •  •  •  Ma,3,3). 
For  each  digit  x  in  {0, 1,2,3},  assign  a  (^respectively  12,3)  to  the  third  least  significant  digit  of  processor 
identifiers  in  A  (^(^(respectively  A(ia  i)^4(i,x,2)*^(1a  ,3)). 

(4)  Let  i  >  4.  Recursively  create  4(,_1)  (2(,_,)-l)-moddiamonds  and  assign  a  0,  1,  2,  or  3  to  the  ith  least 
significant  digit  of  processor  identifiers.  The  recursion  stops  after  creating  1-moddiamonds  and  assigning 
the  appropriate  digits  to  the  processor  identifiers  in  the  1 -diamonds.  We  then  assign  the  most  significant 
digits  to  each  processor  in  the  1-moddiamonds  as  in  Figure  4.9.  Figure  4. 10  shows  the  processor 
identifiers  in  a  64-mesh. 

4.7.5  Proof  of  the  lower  bound 

Theorem  7.1:  Assume  that  n  is  4;,  and  /  >  2.  Let  n  be  a  comparison  algorithm  that  elects  a  leader  in 
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every  synchronous  «-mesh.  Then  there  is  an  execution  e  of  n  that  uses  at  least  —  n  messages. 

Proof:  Consider  the  n-mesh  S  whose  processor  identifiers  are  assigned  as  described  in  Section  4.7.4.  Let 
messages(e)  denote  the  number  of  messages  that  e  uses  in  5.  We  will  prove  the  theorem  by  showing  that 


Figure  4.9.  Processor  Identifiers  in  a  1-moddiamond 


.  .  .  n  n  n 

messages  (e)  —  +  —  +  — . 

First,  we  show  that  messages  (e)  £  n.  Initiall  ,  no  processor  knows  the  identifier  of  any  other 
processor.  Thus  the  state  of  every  processor  p  is  order-equivalent  to  the  state  of  every  other  processor. 
Since  n  is  a  comparison  algorithm,  every  processor  sends  a  message  to  the  east  (respectively 
south  .west, north)  if,  and  only  if,/?  sends  a  message  to  the  east  (respectively  south,west,north).  Thus 
messages  (e)  2:  n. 

Second,  we  show  that  messages  (e)  >  n  +  y .  Let  0  be  the  first  round  in  n  in  which  p  sends  a 

message.  In  round  0,  if  p  sends  more  than  one  message,  then  every  processor  sends  more  than  one 
message,  and  so  messages  (e)  £  2 n.  Thus  suppose  that  p  sends  only  one  message  in  round  0.  Without 
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loss  of  generality,  suppose  that  p  (hence  every  processor)  sends  the  message  to  the  east.  Recall  that  pw  is 
p's  west  neighbor,  that  state  (p,i)  is  the  state  of  p  at  the  start  of  round  i,  and  that  the  state  of  a  processor  at 
the  end  of  round  i  is  the  same  as  the  state  of  the  processor  at  the  start  of  round  i+1.  Then  state  (p,  1)  = 
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(ID  (p  ),nii,niUD  (pw),nil).  At  the  end  of  round  0,  there  are  two  types  of  states.  The  state  (p,  1 )  is  of  the 

first  type  if  ID  (pw)  <ID(p).  The  state  (p.  1)  is  of  the  second  type  if  ID  ( pw )  >  ID  (p).  Suppose  that 

state  (p,  1)  is  of  the  first  type,  and  state  (q,  1)  is  of  the  second  type.  The  state  of  every  processor  at  the  end 

of  round  0  is  order-equivalent  to  either p  or  q.  By  the  way  we  assigned  identifiers  to  processors,  it 

suffices  to  compare  the  most  significant  digits  of  the  identifiers  to  determine  whether  ID  (pw)  <ID(p). 

n  ft 

Thus  there  are  —  processors  whose  states  are  order-equivalent  to  state  (p,  1),  and  there  are  —  processors 

whose  states  are  order-equivalent  to  state  ( q ,  1).  Hence  n  can  not  terminate  at  the  end  of  round  0.  Let  i  \ 

be  the  first  round,  after  round  0,  in  which  a  processor  sends  a  message.  Then  there  will  be  at  least  — 

messages  sent  in  i  [  in  the  same  direction.  Thus  messages  (e)  t  n  +  y. 

Third,  we  show  that  messages  (e)  £  n  +  y  +  y .  If  both  p  and  q  send  messages  in  round  i  i ,  then  all 
the  processors  send  messages  in  round  i  j ,  and  so  messages  (e)  >  In.  If  either  p  or  q  sends  more  than  one 
message  in  round  i  t ,  then  at  least  y  processors  send  more  than  one  message  in  round  i  x ,  and  so 

messages  (e)  2  2 n.  Thus  consider  what  happens  when  either  q  or  p  sends  only  one  message  in  round  i  \ . 
We  will  show  that  the  state  of  every  processor  at  the  end  of  round  i  \  is  order-equivalent  to  the  states  of  at 

least  y  processors. 

Case  1:  Suppose  that  p  sends  only  one  message  to  the  direction  dir,  and  q  does  not  send  any  message, 
where  dire  {east,south,west,north).  There  are  three  subcases: 

Case  1.1:  dir  =  east.  Then  state  (p,i\+l)  =  state  (p,  1).  Thus  there  are  y  processors  whose  states  are 

order-equivalent  to  state  (p.r'i+1),  namely,  the  processors  whose  states  were  congruent  to  state  ip,  1)  at 
the  beginning  of  round  1.  On  the  other  hand,  by  the  way  identifiers  are  assigned  to  processors,  it  suffices 
to  compare  the  most  significant  digits  of  the  identifiers  to  show  that  state  ( q,i  i+l)  is  order-equivalent  to 

the  states  of  —  processors.  Since  the  state  of  every  processor  at  the  beginning  of  round  0  is  order- 
4 
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equivalent  to  either  state  (p,  1)  or  state  (q,  1),  the  state  of  every  processor  is  order-equivalent  to  the  state 

of  at  least  —  processors  at  the  end  of  phase  i  i . 

4 

Case  12:  dir  =  south.  Then  state  {q,i  !+l)  =  state  ( q ,  1).  Thus  there  are  y  processors  whose  states  are 

order-equivalent  to  state  (q,i  i+l).  namely  those  processors  whose  states  were  congruent  to  state  ( q ,  1)  at 
the  beginning  of  round  1.  On  the  other  hand,  by  the  way  identifiers  are  assigned  to  processors,  it  suffices 
to  compare  the  most  significant  digits  of  the  identifiers  to  show  that  state  (p,i  i+l)  is  order-equivalent  to 

rt  ft 

the  states  of  —  processors.  Thus  the  state  of  every  processor  is  order-equivalent  to  the  state  of  at  least  — 

4  4 

processors  at  the  end  of  phase  i  i . 

Case  1_3:  dir  =  west  or  dir  =  north.  By  the  way  identifiers  are  assigned  to  processors,  it  suffices  to 
compare  the  most  significant  digits  of  the  identifiers  to  show  that  state  (p,i\+l)  is  order-equivalent  to  the 

states  of  y  processors,  and  state  (q,i  i+l)  is  order-equivalent  to  the  states  of  y  processors.  Thus  the  state 

of  every  processor  is  order-equivalent  to  the  state  of  at  least  y  processors  at  the  end  of  phase  i  i . 

Case  2:  Suppose  that  q  sends  only  one  message  to  the  direction  dir,  and  p  does  not  send  any  message, 
where  dire  {east,south,west,north}.  This  case  is  similar  to  Case  1. 

Let  i  2  be  the  first  round,  after  round  i  j ,  in  which  a  processor  sends  a  message.  Since  the  state  of 
every  processor  is  order-equivalent  to  the  state  of  at  least  y  processors  at  the  end  of  phase  i  i ,  there  will 

be  at  least  —  messages  sent  in  i2  in  the  same  direction.  Hence  messages  (e)  >  n  +  +  — . 

4  2  4 

Finally,  we  show  that  messages  (e)  S  rt  +  ~  +  y  +  yy  As  we  explained,  if  a  processor  sends  more 

than  one  message  in  rounds  0,  i  t ,  or  i2,  then  messages  (e)  >  2 n.  Thus  suppose  that  all  processors  send 
only  one  message  in  each  of  these  rounds  Let  i 3  be  the  first  round,  after  round  i2,  in  which  a  processor 
sends  a  message.  Let  e  be  the  execution  fragment  consisting  of  all  the  rounds  up  to,  and  including, 
round  i3-l.  By  the  definitions  of  sum(e  ),  i\,  i2,  and  i3,  sum(e  )  <  3.  By  the  way  identifiers  are 
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assigned  to  processors,  it  suffices  to  compare  the  two  most  significant  digits  of  the  identifiers  to  show  that 
every  4-diamond  has  ~  order-equivalent  4-diamonds.  By  Lemma  7.2,  there  are  at  least  —  messages 

sent  in  round  <3.  Thus,  messages (e)  >  n  +  -—  +  -^  +  -^-.D 

In  the  proof  of  Theorem  7.1,  we  used  only  the  two  most  significant  digits  of  the  processor 
identifiers.  In  the  future,  we  hope  to  use  all  of  the  digits  to  yield  a  better  lower  bound. 
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CHAPTERS 
FUTURE  WORK 

In  this  chapter,  I  present  some  of  the  problems  I  hope  to  solve  in  the  future. 

5.1  Problems  Associated  with  Chapter  2 

An  important  problem  is  to  generalize  the  results  of  the  paper  "Memory  requirements  for  agreement 
among  unreliable  asynchronous  processes"  [2]  to  the  case  when  a  memory  cell’s  value  spontaneously 
oscillates  between  several  values.  There  are  two  reasons  to  study  memory  oscillation.  First,  memory 
oscillation  models  the  situation  where  some  faulty  cells  are  repaired  on-line  while  some  processes  may  be 
accessing  them.  Requiring  such  processes  to  abort  their  executions  and  restart  their  algorithms  would  be 
inefficient  and  inconvenient.  Second,  by  examining  memory  oscillation,  I  wish  to  study  whether  cheap 
but  possibly  faulty  memory'  can  be  used  in  place  of  expensive  but  reliable  memory  without  significant 
performance  degradation.  Therefore,  I  plan  to  formulate  lower  and  upper  bounds  on  the  number  of 
three-valued  cells  needed  in  test-and-set  agreement  protocols  when  processes  die  and  memory  cells  fail 
undetectably.  Elaborate  coding  schemes  do  not  immediately  solve  this  problem  since  the  processes 
cannot  set  more  than  one  cell  at  a  time,  and  the  processes  may  die  at  any  step.  For  the  lower  bounds,  I 
will  generalize  the  concept  of  computation  graphs  [2]  to  handle  memory  failure.  A  faulty  cell  will  be 
modeled  as  a  reliable  cell  that  a  Byzantine  process  p  spontaneously  writes.  Process  p  does  not  access  any 
other  cell.  For  the  upper  bounds,  I  intend  to  build  on  Jaffe’s  techniques  [31]. 

52  Problems  Associated  with  Chapter  3 

Consider  a  general  network  in  which  each  edge  has  a  weight  that  corresponds  to  the  cost  of  sending 
a  message  along  the  edge.  Minimum  weight  spanning  trees  of  the  network  are  used  for  broadcasting  in 
computer  networks  with  point-to-point  links.  Tsin  [48]  presented  a  distributed  algorithm  for  updating  a 
minimum  spanning  tree  when  a  new  vertex  is  added  to  the  underlying  graph.  I  hope  to  study  the  problem 
of  constructing  a  minimum-weight  spanning  tree  on  the  network  when  some  of  the  network  links  fail 
intermittently.  Building  on  the  work  of  Korach,  Moran,  and  Zaks  [33],  1  will  first  consider  the  simpler 
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problem  of  constructing  spanning  trees  (not  necessarily  minimum-weight)  on  general  networks  with 
faulty  links.  An  algorithm  that  solves  the  simpler  problem  will  be  an  algorithm  for  election  in  general 
networks  with  faulty  links,  since  the  root  of  a  spanning  tree  will  be  the  leader  of  the  network.  Next,  by 
using  the  techniques  of  Frederickson  [21],  Gallager,  Humblet,  and  Spira  [24],  and  Tsin  [48],  I  expect  to 
generalize  the  solution  of  the  simpler  problem  to  the  problem  of  constructing  minimum-weight  spanning 
trees  when  some  of  the  network  links  may  be  faulty. 

Deadlock  detection  is  a  fundamental  problem  in  distributed  databases  and  distributed  operating 
systems.  Deadlock  occurs  when  a  cycle  of  n  processes  Po.Pi,  •  •  • ,  P„~i,Po  forms  such  that  each/*, 
waits  for  a  resource  held  by  *■  I  plan  to  construct  deadlock  detection  algorithms  for 

asynchronous  complete  networks  with  Byzantine  links.  Complete  networks  in  this  case  may  be  either 
physical  networks  or  virtual  networks  at  the  session  layer  of  the  OSI  model  of  computer  networks.  I 
expect  to  use  the  techniques  of  Chandy,  Misra,  and  Haas  [10]  and  Awerbuch  and  Micali  [6], 

S3  Problems  Associated  with  Chapter  4 

Peterson  [41]  proposed  an  efficient  algorithm  forelection  in  reliable  asynchronous  meshes. 
Although  the  algorithm  uses  O(n)  messages,  which  is  asymptotically  optimal,  the  constant  in  the  big  0 
notation  is  large.  Since  the  number  of  messages  that  an  algorithm  uses  is  independent  of  machine 
implementations,  it  is  desirable  to  have  a  small  constant.  I  plan  to  improve  the  lower  and  upper  bounds 
on  the  number  of  messages  for  synchronous  meshes  as  a  first  step  in  deriving  a  lower  bound  and  a 
matching  upper  bound  on  the  number  of  messages  in  asynchronous  meshes.  The  analysis  of  the  lower 
bound  given  in  Chapter  4  can  be  improved  in  two  ways.  First,  I  hope  to  have  a  better  analysis  of  the 
number  of  messages  when  the  processor  identifiers  are  distributed  as  in  Section  4.7.4.  Second,  I  plan  to 
obtain  a  distribution  of  processor  identifiers  that  yields  a  better  lower  bound.  On  the  other  hand,  the 
upper  bound  in  Chapter  4  can  be  improved  by  using  the  solutions  to  the  firing  squad 
problem  [26],  [38],  [50]  to  efficiently  remove  the  requirement  that  all  the  processors  start  the  algorithm 
simultaneously. 
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